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1  Project  Summary 

Botnets  have  become  a  leading  cyber  threat  because  the  attackers  can  use  Command  & 
Control  (C&C)  to  launch  sophisticated  attacks.  Despite  many  studies  dealing  with  botnets, 
there  is  a  dearth  of  effective  defenses  against  them,  as  evidenced  by  their  continued  per¬ 
vasiveness.  This  project  aims  to  combat  stealthy  botnets  using  a  novel  framework  we  call 
IAPD,  which  stands  for  “Integrated  Adaptive  and  Proactive  Defenses.”  By  “proactive,”  we 
mean  that  the  defenses  can  deal  with  both  current  and  future  botnets  and  that  they  do  not 
solely  rely  on  signature-based  countermeasures  (otherwise,  the  defenders  will  always  lag  be¬ 
hind  the  attackers).  By  “adaptive,”  we  mean  that  the  defenses  can  deal  with  a  spectrum  of 
botnets,  from  less  stealthy  to  extremely  stealthy,  in  an  automated  fashion.  By  “integrated,” 
we  mean  that  the  adaptive  and  proactive  defenses  are  seamlessly  integrated  together,  while 
being  able  to  utilize  information  provided  by  other  defensive  systems.  We  believe  that  the 
framework  represents  a  necessary  step  towards  a  holistic  solution. 
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Figure  1:  A  systematic  architecture  for  defending  against  bots,  botnets  and  malware  (in¬ 
cluding  Advanced  Persistent  Threats) 

To  achieve  the  goal,  we  take  a  systems-and-theory  methodology,  meaning  that  on  one 
hand,  we  want  to  build  systems  that  can  deal  with  stealthy  attacks,  and  on  the  other  hand,  we 
want  to  build  a  theoretical  and  foundational  understanding  of  (defenses  against)  botnets  in 
particular  and  malware  attacks  in  general.  Such  theoretical  understanding  will  pave  the  way 
for  achieving  principled  modeling,  management,  and  decision-making  in  cyber  defense.  For 
systems  research,  we  have  built  real-life  stealthy  bot/malware  analysis  and  detection  systems 
with  the  architecture  shown  in  Figure  1,  which  includes  application-oriented,  infrastructure- 
oriented,  and  cross-layer  detections  that  are  seamlessly  integrated  together.  We  will  elaborate 
our  research  activities  and  results  from  these  perspectives.  For  theoretical/foundational 
research,  we  have  built  mathematical  models  for  understanding  and  reasoning  the  attack- 
defense  dynamics  in  cyberspace.  As  we  will  see,  we  have  obtained  some  exciting  results,  which 
shed  light  on  future  research  directions  that  can  lead  to  revolutionary  results  in  cybersecurity. 
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1.1  Highlight  of  Findings 

Here  we  highlight  some  of  the  findings  as  follows. 

•  There  are  many  bots/malwares  that  can  evade  the  COTS  anti-malware  detection  tools. 
Next  generation  cyber  defense  technology  will  have  to  be  simultaneously 
behavior-based ,  adaptive  and  proactive.  We  have  made  substantial  progress  on 
these  fronts  [7,  2,  8,  9,  11,  10,  12,  16,  15].  Our  future  studies  aim  to  systematically  char¬ 
acterize  the  distinguishing  features  of  malwares  based  on  our  large  malware  database. 
This  will  guide  us  to  design  the  next  generation  malware/bot  defense  technology. 

•  In  order  to  quantify  the  (in) effectiveness  of  cyber  defense,  we  need  a  theoretical  foun¬ 
dation.  Our  studies  showed  that  for  this  purpose,  multiple  disciplines  can  serve  as 
useful  modeling  tools,  including  Stochastic  Processes  [17,  22,  18],  Dynamical  Sys¬ 
tems  [23,  25,  22,  21],  Control  Theory  [24],  Statistical  Physics  [13,  14,  20]  and  Statistics 
[26].  We  will  further  our  studies  toward  a  unified  foundation.  Our  envisioned  founda¬ 
tion  will  become  on  indispensable  cornerstone  of  the  emerging  Science  of  Security.  In 
particular,  we  found  that  existing  mathematical/statistical/physical  theories  are  not 
sufficient  for  cybersecurity;  instead,  we  need  to  develop  new  theories  that  are  guided 
by,  and  centered  on,  cybersecurity  problems  (i.e.,  “mathematical  theories  with  cyber¬ 
security  meanings  /  significance” ) . 

•  We  made  several  conceptual  advancements.  For  example,  there  are  two  classes  of 
malware-based  attacks:  push-based  and  pull-based  [15,  16,  23].  Systematic  character¬ 
ization  of  these  classes  of  attacks  will  deepen  our  understanding  of  cybersecurity.  For 
this  purpose,  we  introduced  some  core  abstractions  for  understanding  cybersecurity,  in¬ 
cluding  stochastic  cyber  attack  processes  [26],  stochastic  cyber  attack-defense  processes 
[17,  18],  cyber  attack-defense  dynamics  [23,  25,  22,  21],  Finally,  cyber  attacks  and  cy¬ 
ber  attack-defense  interactions  exhibit  rich  phenomena  [26,  22,  21],  This  highlights 
the  importance  of  studying  the  phenomenon-perspective  of  cybersecurity 
(or  cybersecurity  phenomena),  which  was  not  recognized  until  now. 

•  Our  study  touched  several  deepest  concepts  in  cybersecurity.  For  example,  we  showed 
for  the  first  time  that  cyber  attack  traffic  exhibits  the  so-called  Long-Range  Dependence 
[26],  which  was  previously  known  to  be  exhibited  by  benign  traffic.  This  hints  that 
it  is  perhaps  impossible  to  detect  (stealthy)  attacks  based  on  traffic  alone. 
This  sheds  lights  on  the  detectability  of  cyber  attacks.  On  the  other  hand,  we 
showed  that  Long-Range  Dependence  can  be  exploited  to  predict  cyber  attacks  at  a 
reasonable  accuracy  [26].  This  sheds  light  on  the  predictability  of  cyber  attacks. 
It  would  be  of  fundamental  value  to  characterize  the  interaction  (or  trade-off)  between 
detectability  and  predictability  of  cyber  attacks. 

The  above  suggests  exciting  research  directions  that  will  lead  to  revolutionary  results. 

Report  outline.  Section  2  reports  our  research  results.  Section  3  describes  our  Technology 
Transfer  activities.  In  Section  4  discusses  the  DoD  relevance  of  the  research  results.  Ap¬ 
pendix  A  is  the  copies  of  some  of  the  papers  that  acknowledge  the  research  grant.  Appendix 
B  is  the  reports  of  the  subcontracts  of  the  supplementary  equipment  grant. 
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2  Research  Results 

In  this  Section  we  describe  our  research  results,  which  are  classified  into  two  categories: 
systems  research  and  theoretical/foundational  research.  The  two  categories  of  studies  are 
complementary  to  each  other. 

2.1  Systems  Research 

We  report  our  results  on  characterizing  the  incapabilities  of  the  current  generation  of  COTS 
malware/bot  defense  technologies.  Our  main  results  are  some  novel  approaches  to  the 
next  generation  malware/bot  defense  technologies,  including  (1)  mechanism-level  counter¬ 
measures  against  stealthy  malwares/bots  from  an  application-oriented,  host-oriented,  and 
infrastructure-oriented  perspectives,  and  (2)  architecture-level  designs  for  incorporating  the 
novel  mechanisms  into  a  seamlessly  integrated  framework.  We  also  discuss  our  prototype 
system  for  behavior-centric  detection  of  stealthy  malware.  We  are  in  the  process  of  sub¬ 
stantially  enhancing  this  prototype  system,  which  will  have  a  great  potential  for  commercial 
success. 

2.1.1  Limitations  of  State-of-the-Art  COTS  Malware/Bot  Defenses  Technology 

Evaluating  Detection  and  Treatment  Effectiveness  of  COTS  Anti-Malware  Pro¬ 
grams.  Commercial  anti-malware  programs  consist  of  two  main  components:  detection 
and  treatment.  Detection  accuracy  is  often  used  to  rank  effectiveness  of  commercial  anti- 
malware  programs  with  less  emphasis  on  the  equally  important  treatment  component.  Ef¬ 
fectiveness  measures  of  commercial  anti-malware  programs  should  consider  equally  detection 
and  treatment.  This  can  be  achieved  by  standardized  measurements  of  both  components. 
This  paper  [11]  presents  a  novel  approach  to  evaluate  the  effectiveness  of  a  commercial 
anti-malware  program’s  detection  and  treatment  components  against  malicious  objects  by 
partitioning  true  positives  to  incorporate  detection  and  treatment.  This  new  measurement 
is  used  to  evaluate  the  effectiveness  of  four  commercial  anti-malware  programs  in  three  tests. 
The  results  show  that  several  anti-malware  programs  produced  numerous  incorrectly  treated 
or  untreated  true  positives  and  false  negatives  leaving  many  infected  objects  unresolved  and 
thereby  active  threats  in  the  system.  These  results  further  demonstrate  that  our  approach 
evaluates  the  detection  and  treatment  components  of  commercial  anti-malware  programs  in 
a  more  effective  and  realistic  manner  than  currently  accepted  measurements  which  primarily 
focus  on  detection  accuracy. 

Evaluating  the  Limitations  of  Commercial  Anti-Malware  Tools.  Commercial  anti- 
malware  programs  (CAmps)  have  become  mainstream  security  products  and  are  widely 
deployed.  In  practice,  perhaps  due  to  economic  factors,  most  users  only  purchase  and  deploy 
a  single  anti-malware  program.  Since  it  is  known  and  proven  that  there  is  no  universally 
effective  anti-malware  program  (which  can  determine  whether  a  given  software  program 
is  a  malware),  this  effectively  bases  malware  defense  on  the  implicit  assumption  that  at 
least  the  popular  anti-malware  programs  can  provide  sufficient  security.  This  assumption, 
surprisingly,  has  been  neither  justified  nor  examined  in  a  systematic  fashion.  Motivated 
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by  our  observation  obtained  in  our  last-year  study  that  our  anomaly-based  detection  can 
recognize  malwares/bots  that  were  not  detected  by  commercial  anti-malware  detection  tools, 
in  [12]  we  propose  a  methodological  framework  for  examining  this  assumption.  We  define  a 
CAmp  to  be  competent  when  it  detects  and  cleans  all  malware  present  on  a  system.  Our 
initial  experiments  demonstrate  that  a  single  CAmp  may  not  suffice.  It  is  also  not  clear  yet 
how  many  CAmps  are  needed  to  achieve  competence  in  the  majority  of  malware  scenarios, 
though  we  attempt  to  forecast  a  suitable  amount  based  on  our  experimental  results.  Our 
ongoing  study  aims  to  resolve  this  issue  to  a  satisfactory  level. 

2.1.2  Novel  Application-Oriented  Malware/Bot  Defense  Mechanisms 

Defense  against  Social  Network-based  Stealthy  Botnets  Command-and-Control. 

This  is  an  important  aspect  that  has  not  been  paid  due  attention.  Very  recently,  botmasters 
have  begun  to  exploit  social  network  websites  (e.g.,  Twitter.com)  as  their  C&C  infras¬ 
tructures,  which  turns  out  to  be  quite  stealthy  because  it  is  hard  to  distinguish  the  C&C 
activities  from  the  normal  social  networking  traffic.  This  is  caused  by  that  attackers  can 
“legitimately”  abuse  applications  that  require  little  or  no  authorization,  which  means  that 
the  resulting  malicious  activities  are  perfectly  camouflaged  into  the  traffic  of  honest  users.  In 
[2] ,  we  studied  the  problem  of  using  social  networks  as  botnet  C&C  infrastructures.  Treating 
as  a  starting  point  the  current  generation  of  social  network-based  botnet  C&C,  we  envision 
the  evolution  of  such  C&C  methods  (which  may  become  real  in  the  near  future)  and  explore 
social  networks-based  countermeasures.  Specifically: 

•  We  characterize  the  current-generation  of  social  network-based  botnet  C&Cs,  describ¬ 
ing  their  strengths  and  weaknesses. 

•  We  envision  how  current  social  network-based  botnet  C&Cs  might  evolve  in  the  near 
future,  which  capitalize  on  their  strengths  while  diminishing  their  weaknesses. 

•  We  explore  countermeasures  for  dealing  with  both  current  and  future  generations  of 
social  network-based  botnet  C&Cs.  Since  social  network  providers  as  well  as  client 
machines  are  victims  of  a  social  network-based  botnet  C&C,  both  server-side  and 
client-side  countermeasures  are  demonstrated  and  tested  for  both  effectiveness  and 
performance. 

2.1.3  Novel  Host-Oriented  Cross-Layer  Malware/Bot  Defense  Mechanisms 

Unlike  traditional  approaches,  we  want  to  achieve  (1)  behavior-based  rather  than  signature- 
based  detection  of  malwares/bots,  and  (2)  high  resolution  rather  than  low  resolution  detec¬ 
tion  of  malwares/bots,  namely  that  we  want  to  detect  the  malicious  processes  running  on  a 
machine  rather  than  just  pointing  out  which  machines  are  compromised. 

Differentiating  Malicious  Programs  and  Benign  Programs  Through  the  Process 
Lens.  In  [9],  we  presented  three  sets  of  process-based  symptoms  drawn  from  known  mal¬ 
ware  and  bot  samples  —  bot  network  activity  behavior,  unreliable  provenance  and  stealth 
mechanisms  —  that  are  integrated  together  to  detect  bot  processes  on  a  host  machine. 
Specifically,  we  accomplish  the  following: 
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•  The  process-based  identification  of  (1)  Bot  network  activity  behavior:  failed  TCP 
connection  attempts,  DNS  and  reverse  DNS  queries;  (2)  Process  provenance:  using 
static  file  image  digital  signature  verification  and  process/file  system  tampering;  (3) 
Stealth  mechanisms:  using  the  absence  of  a  graphical  user  interface  and  no  required 
user  input  to  execute. 

•  A  formal  detection  model  based  on  a  non-trivial  use  of  established  data  mining  algo¬ 
rithms.  We  conducted  a  thorough  experiment  on  generating  and  evaluating  detection 
models.  Results  show  our  methodology  leads  to  better  detection  accuracy  for  both 
centralized  and  Peer-to-Peer  (P2P)  bots  than  a  straightforward  use  of  established  data 
mining  algorithms. 

Constructing  Malware  Infection  Trees.  Disinfecting  unknown  malware  from  a  system 
is  a  challenging  task.  A  malware  infection  tree  (MiT)  can  assist  a  user  in  disinfecting  by 
terminating  processes  and  deleting  files  related  to  the  malware.  In  [10],  we  propose  an 
approach  to  constructing  a  comprehensive  MiT  based  on  execution  event  rules.  These  rules 
abstractly  define  malware  infection  strategies  in  files  and  processes.  We  implement  the  rules 
in  a  Windows  OS  tool  named  MiTCoN  which  was  used  to  remove  malware  files  and  processes 
resulting  in  a  significant  malware  eradication  from  the  system.  We  conclude  our  rules  are 
strong  enough  to  cover  the  most  important  infection  strategies  and  usable  in  helping  disinfect 
a  system  compromised  by  both  known  and  unknown  malware. 

Cross-Layer  Detection  of  Malicious  Websites.  Web  threats  pose  the  most  significant 
cyber  threat.  Websites  have  been  developed  or  manipulated  by  attackers  for  use  as  attack 
tools.  Existing  malicious  website  detection  techniques  can  be  classified  into  the  categories 
of  static  and  dynamic  detection  approaches,  which  respectively  aim  to  detect  malicious 
websites  by  analyzing  web  contents,  and  analyzing  run-time  behaviors  using  honeypots. 
However,  existing  malicious  website  detection  approaches  have  technical  and  computational 
limitations  to  detect  sophisticated  attacks  and  analyze  massive  collected  data.  The  main 
objective  of  this  research  [15]  is  to  minimize  the  limitations  of  malicious  website  detection. 
This  paper  presents  a  novel  cross-layer  malicious  website  detection  approach  which  analyzes 
network-layer  traffic  and  application-layer  website  contents  simultaneously.  Detailed  data 
collection  and  performance  evaluation  methods  are  also  presented.  Evaluation  based  on  data 
collected  during  37  days  shows  that  the  computing  time  of  the  cross-layer  detection  is  50 
times  faster  than  the  dynamic  approach  while  detection  can  be  almost  as  effective  as  the 
dynamic  approach.  Experimental  results  indicate  that  the  cross-layer  detection  outperforms 
existing  malicious  website  detection  techniques. 

2.1.4  Novel  Infrastructure-Oriented  Malware/Bot  Defense  Mechanisms 

Differentiating  Malicious  Programs  and  Benign  Programs  Through  the  DNS 
Lens.  In  [7],  we  investigated  a  bot  process’s  DNS  activities,  including  DNS  and  Re¬ 
verse  DNS  (rDNS)  queries.  More  specifically,  we  focused  on  the  process’s  reaction-to-DNS- 
response  behavior  (RD  behavior)  occurring  in  the  initial  join  phase  during  the  early  stages 
of  a  bot  process’s  life  cycle.  Our  detection  approach  of  RD  behavior  can  be  implemented 
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at  any  point  in  a  bot’s  life  cycle;  we  choose  to  focus  on  the  early  stage  because  we  want  to 
prevent  damage  and  malware  distribution  in  the  host  machine  and  network.  This  is  possi¬ 
ble  because  during  the  initial  join  phase,  bots  may  frequently  use  DNS  activity  to  assist  in 
locating  their  Command  &  Control  (C&C)  servers  or  other  peers.  We  organized  different 
paths  of  RD  behavior  that  can  occur  in  the  join  phase  as  a  directed  tree,  classifying  expected 
versus  anomalous,  and  thereby  suspicious,  RD  behavior.  We  analyzed  five  currently  active 
centralized  and  P2P  bots,  benign  network  applications  and  non-bot  malware.  During  anal¬ 
ysis,  we  identified  suspicious  RD  behavior.  We  compared  our  analysis  of  two  commercial 
bot  detectors  and  combined  the  results  to  improve  detection  accuracy.  In  summary,  the 
contributions  of  this  research  are: 

•  Identify  a  novel  vector  of  suspicious  process  behavior  based  on  the  process’s  reaction- 
to-DNS-response  (RD  behavior).  We  further  represent  suspicious  behavior  via  paths 
on  a  directed  tree  of  DNS  activity  combined  with  RD  behavior. 

•  Enhance  host-based  detection  methods  with  a  new  vector  of  detection,  namely  suspi¬ 
cious  RD  behavior.  We  target  bot  processes  rather  than  just  bot  machines. 

•  We  show  that  this  suspicious  behavior  often  occurs  in  the  early  stages  of  bot  execu¬ 
tion,  thus  detection  at  this  point  in  time  can  prevent  the  bot  from  executing  received 
commands. 

Differentiating  Malicious  Programs  and  Benign  Programs  Through  the  Network 
Lens.  In  [8],  we  systematically  address  the  following  questions:  From  a  networking  per¬ 
spective,  do  malicious  programs  (malware,  bots,  viruses,  etc.)  behave  differently  from  benign 
programs  that  run  daily  for  various  needs?  If  so,  how  may  we  exploit  the  difference  in  net¬ 
work  behavior  to  detect  them?  To  address  these  questions,  we  are  systematically  analyzing 
the  behavior  of  a  large  set  (at  the  magnitude  of  2,000)  of  malware  samples.  This  research 
analyzes  known  malicious  and  benign  samples  in  an  attempt  to  exploit  differences  in  their 
network  behavior  to  accomplish  accurate  behavior  based  malware  detection. We  present  our 
initial  results  after  analyzing  1000  malware  samples.  The  results  show  that  malicious  and 
benign  programs  behave  quite  differently  from  a  network  perspective.  We  are  still  in  the 
process  of  understanding  the  differences,  which  nevertheless  have  been  utilized  to  detect 
31  malware  samples  which  evaded  detection  of  all  antivirus  software  on  Virus- 
total. com  on  01  April  2010,  giving  evidence  that  the  differences  between  malicious  and 
benign  network  behavior  has  a  possible  use  in  helping  stop  zero-day  attacks. 

2.1.5  Novel  Next  Generation  Malware/Bot  Defense  Architecture  and  System 

Integrated  Adaptive  and  Proactive  Defense.  Cyber  threats  against  clouds  are  dy¬ 
namic  and  evolve  rapidly.  This  means  that  traditional  reactive  defense  is  far  from  sufficient. 
In  the  present  chapter  [6],  we  advocate  that  proactive  defense  should  be  widely  deployed 
because  it  offers  the  defender  situation  awareness  and  early  warning  capabilities  in  com¬ 
bating  against  the  dynamic  cyber  threats.  Specifically,  we  describe  a  security  architecture 
for  proactively  hunting  for  new  and  possibly  zero-day  attacks.  The  architecture  consists  of 
a  set  of  seamlessly  integrated  systematic  security  mechanisms  at  the  application  layer,  the 


network  layer,  and  the  system  layer.  Features  of  the  architecture  include:  (i)  it  is  centered  on 
proactive  defense;  (ii)  it  facilitates  that  a  new  attack  detected  at  one  layer  may  lead  to  coun¬ 
termeasures  that  can  be  deployed  at  the  other  layers;  (iii)  it  uses  command  and  control  (C2) 
centers  to  coordinate  both  in-cloud  and  cross-cloud  defense  activities.  Integrating  adaptive 
and  proactive  defense  requires  the  support  of  trustworthy  sensing  or  situation  awareness  [19]. 

Proactive,  Cross-Layer,  Resilient  Detection  of  Malicious  Websites.  Malicious 
websites  are  an  important  infrastructure  of  Internet  criminal  activities.  The  resulting  attacks 
are  often  stealthy  and  more  difficult  to  detect.  There  has  been  broad  interest  in  developing 
systems  to  prevent  end  users  from  visiting  such  sites.  In  [16],  we  investigate  the  rapid  detec¬ 
tion  of  malicious  websites  by,  among  other  things,  tracking  URL  redirections  and  conducting 
content-based  analysis.  Specifically,  we  examine  websites  collected  by  our  client  honeypot 
and  other  data  sources.  We  demonstrate  the  usefulness  of  cross  application-network  layer 
security  analysis.  We  characterize  conditions  under  which  our  detection  techniques  can  or 
cannot  be  evaded.  An  automated  software  tool  for  this  purpose  is  under  development. 

Fighting  Through:  Tolerating  Penetration  of  Stealthy  Malwares.  In  [1],  we  pro¬ 
posed  and  investigated  a  novel  solution  to  tolerating  the  penetration  of  stealthy  malware 
into  computer  systems,  with  an  emphasis  on  the  protection  of  cryptographic  trustworthiness 
that  cannot  be  resolved  by  (for  example)  smart  cards.  Digital  signatures  are  an  important 
mechanism  for  ensuring  data  trustworthiness  via  source  authenticity,  integrity,  and  source 
nonrepudiation.  However,  their  trustworthiness  guarantee  can  be  subverted  in  the  real  world 
by  sophisticated  attacks,  which  can  obtain  cryptographically  legitimate  digital  signatures 
without  actually  compromising  the  private  signing  key.  This  problem  cannot  be  adequately 
addressed  by  a  purely  cryptographic  approach,  by  the  revocation  mechanism  of  Public  Key 
Infrastructure  (PKI)  because  it  may  take  a  long  time  to  detect  the  compromise,  or  by  using 
tamper-resistant  hardware  because  the  attacker  does  not  need  to  compromise  the  hardware. 
This  problem  will  become  increasingly  more  important  and  evident  because  of  stealthy  mal¬ 
ware  (or  Advanced  Persistent  Threats).  In  this  paper,  we  propose  a  novel  solution,  dubbed 
Assured  Digital  Signing  (ADS),  to  enhancing  the  data  trustworthiness  vouched  by  digital 
signatures.  In  order  to  minimize  the  modifications  to  the  Trusted  Computing  Base  (TCB), 
ADS  simultaneously  takes  advantage  of  trusted  computing  and  virtualization  technologies. 
Specifically,  ADS  allows  a  signature  verifier  to  examine  not  only  a  signatures  cryptographic 
validity  but  also  its  system  security  validity  that  the  private  signing  key  and  the  signing 
function  are  secure,  despite  the  powerful  attack  that  the  signing  application  program  and 
the  general-purpose  Operating  System  (OS)  kernel  are  malicious.  The  modular  design  of 
ADS  makes  it  application-transparent  (i.e.,  no  need  to  modify  the  application  source  code  in 
order  to  deploy  it)  and  almost  hypervisor-independent  (i.e.,  it  can  be  implemented  with  any 
Type  I  hypervisor).  To  demonstrate  the  feasibility  of  ADS,  we  report  the  implementation 
and  analysis  of  an  Xen-based  ADS  system. 

A  Prototype  System  of  Next  Generation  Behavior-Centric  Malware/Bot  Defense 
System.  In  order  to  detect  stealthy  and  new  (zero-day)  malware,  we  need  anomaly-based 
detection  systems.  The  proposed  project  will  develop  a  novel  anomaly-based  detection  sys- 
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tem.  The  system  is  based  on  malware-like  symptoms  and  can  detect  new  malware  based 
on  their  execution  behavior,  while  incurring  low  (acceptable)  false-positives.  The  symptoms 
are  extracted  from  our  systematic  study  of  a  large  set  of  malware  samples,  at  the  magnitude 
of  105  —  106.  This  is  made  possible  by  our  automated  malware  behavior  analysis  engine  on 
the  Microsoft  Windows  platform,  which  is  currently  called  Online  Malware  Analysis  System 
(OMAS,  http://omas.ics.utsa.edu/)  and  released  for  public  testing.  Compared  with 
other  host-based  behavior-based  malware  analysis  systems,  OMAS  is  unique  in  that  it  can 
detect,  via  our  own  behavior-based  malware  detection  algorithms,  whether  the  program  in 
question  is  malicious  or  not.  In  other  words,  OMAS  can  not  only  answer  the  question  “what 
a  program  in  question  does  to  a  system”  (which  is  also  answered  by  other  comparable  anal¬ 
ysis  systems),  but  also  answer  the  question  “whether  the  program  in  question  is  malicious” 
(which  is  not  offered  by  other  comparable  analysis  systems).  The  current  version  of  OMAS 
has  the  following  symptoms: 

•  Self-reference  replication  (srr)  is  a  fundamental  characteristic  of  computer  viruses  and 
worms.  This  is  used  as  a  symptom  which  our  tests  indicate  is  detectable  in  several 
malware  samples.  These  samples  are  broad  and  diverse.  We  define  srr  to  be  a  pro¬ 
cess  which  self-replicates  into  a  newly  created  or  pre-existing  file.  In  our  testing,  srr 
has  proven  a  powerful  early  indicator  of  possible  malicious  behavior.  Dynamic  code 
injection  (DCI)  is  an  anomalous  behavior  extensively  used  by  malware  to  manipulate 
running  processes  into  performing  nefarious  actions.  We  define  DCI  as  a  process  writ¬ 
ing  code  commands  into  the  allocated  memory  of  some  other  already  running  process 
and  then  creating  a  new  instance  of  that  process  which  executes  the  just  written  ne¬ 
farious  code.  This  symptom  has  proven  highly  useful  in  early  malware  detection  as  the 
vast  majority  of  our  tested  malware  samples  have  multiple  identifiable  DCI  instances. 

•  We  have  created  two  more  symptoms  by  combining  DNS  and  TCP  events  which, 
in  our  testing  have  been  identified  in  a  good  number  of  known  malware  samples.  We 
assume  a  successful  DNS  query  returns  an  IP  address  that  can  be  successfully  connected 
to.  We  also  assume  the  input  IP  address  of  a  successful  Reverse  DNS  (rDNS)  query 
can  also  be  successfully  connected  to.  Based  on  this  we  define  as  anomalous,  and 
therefore  suspicious  behavior,  a  failed  TCP  connection  attempt  to  the  input  IP  of 
successful  rDNS  or  to  the  returned  IP  of  a  successful  DNS  query.  Our  testing  has 
found  multiple  instances  of  these  two  symptoms.  We  have  established  that  when  these 
symptoms  appear  in  malware,  it  is  a  result  of  a  remote  host  having  been  shut  down 
or  temporarily  removed.  We  have  observed  malware  perform  many  DNS  events  where 
some  are  identified  under  these  two  symptoms  and  others  follow  our  notion  of  normal, 
non- anomalous,  behavior. 

•  Digital  signatures,  though  not  a  behavior,  are  a  file  attribute  that  we  have  blended 
into  our  symptoms  to  assess  suspicion.  We  believe  that  objects  with  verifiable  digital 
signatures  are  less  suspicious  than  those  without.  In  our  testing  the  vast  majority 
of  malware  samples  either  completely  lack  a  digital  signature  or  have  an  unverifiablc 
signature.  We  find  this  attribute  to  be  very  useful  in  malware  identification. 

We  are  in  the  process  of  enhancing  the  current  design  and  implementation  of  the  prototype, 
with  an  emphasis  on  enhancing  the  malware/bot  behavior-database,  which  will  replace  the 
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current  COTS  solutions  that  are  based  on  malware/bot  signatures. 


2.2  Theoretical  and  Foundational  Research 

Systems  research  aims  to  enhance  defense  and  security.  In  order  to  evaluate  the  situation 
and  the  effectiveness  of  defenses,  we  need  to  have  a  principled  foundation.  Towards  this  goal, 
we  have  made  the  following  substantial  progresses. 

2.2.1  Characterizing  Botnet  Resilience 

Modeling  and  Understanding  Botnet.  In  [4],  we  presented  a  model  for  understanding 
botnets.  The  model  captures  botnet  lifetime  cycles,  botnet  architecture,  botnet  command- 
and-control  (C&C)  mechanisms,  and  a  set  of  botnet  attributes.  In  particular,  the  model 
offers  insights  into  how  some  advanced  techniques  —  such  as  anonymous  communication 
channels,  secret  handshakes,  and  gossiping,  which  might  not  have  been  (widely)  implemented 
in  existing  botnets  —  could  be  adopted  in  future  botnets  to  make  them  more  difficult  to  deal 
with.  The  model  suggests  a  dynamic  graph  abstraction  to  abstract  botnets,  and  the  set  of 
botnet  attributes  can  be  used  to  measure  and  compare  botnets.  The  attributes  proposed  are: 
robustness,  resilience,  sustainability,  exposedness,  bandwidth  consumption,  botnet  size,  botnet 
master  goals,  and  botnet  firepower.  The  attributes  can  be  combined  to  capture  properties; 
for  example,  the  often  mentioned  term  “stealth”  can  be  reflected  by  what  we  call  robustness, 
resilience,  sustainability,  exposedness,  and  bandwidth  consumption. 

Graph-Theoretic  Model  of  Botnet  C&C.  In  [3],  we  proposed  a  graph-based  model 
for  botnet  C&C  mechanisms.  The  model  captures  an  important  aspect  of  C&C  mechanisms 
-  who  knows  whom  and  who  can  send  or  forward  C&C  messages  to  whom.  The  model 
also  accommodates  that  the  botnet  topology  itself  might  have  already  “embedded”  some 
craftiness  of  the  botnet  master  (i.e.,  botnet  topologies  are  carefully  chosen  by  the  masters). 
Moreover,  the  model  can  accommodate  the  master’s  attack  power  or  sophistication  as  well  as 
the  defender’s  detection  capability.  The  model  considers  two  botnet  stealth  measures  called 
detection  ratio  and  resilience.  The  former  aims  to  capture  the  detection  of  bots  due  to  the 
conducting  of  C&C  activities,  and  the  latter  aims  to  capture  the  tracing  of  bots  based  on 
botnet  topology.  Simulation  study  allows  us  to  draw  useful  insights  on  factors  contributing 
to  botnet  stealth,  such  as  the  following.  First,  in  order  to  achieve  the  same  degree  of 
security,  countering  a  more  sophisticated  attack  requires  a  corresponding  improvement  in 
the  defender’s  detection  capability.  Second,  in-  and  out-degree  regular  graphs,  in  which  each 
vertex  has  the  same  in-degree  as  well  as  out-degree,  as  botnet  topology  exhibit  best  observed 
stealth.  In  particular,  such  graphs  exhibit  an  “all  or  nothing”  detection  effect,  meaning  (1) 
either  all  or  no  bots  are  detected  and  (2)  the  defender  cannot  benefit  from  tracing  bots 
according  to  the  botnet  topology.  Moreover,  a  botnet  master  who  is  able  to  maintain  such  a 
botnet  topology  does  not,  in  contrast  to  a  recent  rule  of  thumb,  necessarily  gain  stealth  by 
splitting  a  large  botnet  into  smaller  ones. 

What  Can  Statistical  Physics  Contribute  to  Cybersecurity?  Site  percolation  has 
been  used  to  help  understand  analytically  the  robustness  of  complex  networks  in  the  presence 
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of  random  node  deletion  (or  failure).  In  [13],  we  move  a  further  step  beyond  random  node 
deletion  by  considering  that  a  node  can  be  deleted  because  it  is  chosen  or  because  it  is  within 
some  L-hop  distance  of  a  chosen  node.  Using  the  generating  functions  approach,  we  present 
analytic  results  on  the  percolation  threshold  as  well  as  the  mean  size,  and  size  distribution, 
of  nongiant  components  of  complex  networks  under  such  operations.  The  introduction  of 
parameter  L  is  both  conceptually  interesting  because  it  accommodates  a  sort  of  nonindepen¬ 
dent  node  deletion,  which  is  often  difficult  to  tackle  analytically,  and  practically  interesting 
because  it  offers  useful  insights  for  cybersecurity  (such  as  botnet  defense).  Deeper  results 
are  obtained  in  a  MS  Thesis  [14],  which  is  split  into  papers  for  publications.  In  particular, 
we  found  that  emergent  behavior  is  ubiquitous  in  cybersecurity  [20] . 

2.2.2  Understanding  Effectiveness  of  Next  Generation  Malware/Bot  Defense 
Technology 

Modeling  and  Characterizing  the  Usefulness  of  Adaptive  Control  in  Cyber  De¬ 
fense.  In  [24],  we  investigate  a  non- homogeneous  Susceptible- Infectious-Susceptible  (SIS) 
model  in  arbitrary  networks  (i.e.,  there  is  no  restriction  on  the  topology  of  the  spreading 
networks  and  the  nodes  may  have  different  defense  or  cure  capabilities).  The  model  can 
accommodate  both  semi-adaptive  defense  and  fully- adaptive  defense.  In  the  semi-adaptive 
defense  scenario,  the  input  parameters  in  the  model  are  known  and  can  vary  with  respect  to 
time  (e.g.,  according  to  some  deterministic  functions  of  time  or  according  to  some  stochastic 
process,  but  we  do  not  impose  any  practical  restrictions  on  the  types  of  functions).  For 
this  scenario,  we  present  a  set  of  sufficient  conditions,  from  general  to  specific  (but  more 
succinct),  under  which  the  virus  spreading  will  die  out.  We  note  that  such  sufficient  condi¬ 
tions  are  also  known  as  epidemic  thresholds  in  the  literature.  In  the  fully- adaptive  defense 
scenario,  some  input  parameters  are  not  known  and  thus  the  aforementioned  sufficient  con¬ 
ditions  are  not  applicable.  Nevertheless,  the  defender  might  be  able  to  observe  the  outcome 
of  virus  spreading  (i.e.,  which  nodes  are  infected  at  a  point  in  time).  For  this  scenario,  we 
present  adaptive  control  strategies  under  which  the  virus  spreading  will  die  out  or  will 
be  contained  to  a  desired  level  (which  is  important  when,  for  example,  the  price  to  kill  the 
virus  spreading  may  be  too  high).  Because  of  the  above,  our  model  supersedes  previous 
homogeneous  and  non-homogeneous  models  that  offered  relevant  analytical  insights.  Our 
analytical  results  are  confirmed  via  simulation,  from  which  we  draw  additional  observations 
that  serve  as  hints  for  future  modeling  studies.  We  discuss  the  practical  implications  of  our 
model  and  the  derived  insights  as  well. 

Characterizing  the  Advantage  of  Active  Defense  over  Reactive  Defense.  Current 
cyber  defense  is  mainly  reactive,  and  inevitably  causes  an  asymmetry  that  is  to  the  advantage 
of  the  attacker.  One  approach  to  circumventing  the  asymmetry  is  to  exploit  active  cyber 
defense.  Although  the  concept  of  active  cyber  defense  has  been  proposed  and  debated  for 
years,  the  focus  has  been  primarily  on  policy-related  issues  rather  than  on  characterizing  its 
effectiveness.  In  [22],  we  initiate  the  theoretical  study  on  characterizing  the  effectiveness  of 
active  cyber  defense.  For  this  purpose,  we  propose  a  novel  Markov  model  to  accommodate 
the  interaction  between  cyber  attacks  and  active  defense.  Since  the  Markov  model  is  not 
tractable  because  of  its  exponential  number  of  states,  we  simplify  it  as  a  Dynamic  System 
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model,  which  leads  to  useful  analytical  results  and  insights.  We  use  simulation  to  validate 
the  analytical  results  by  showing  that  they  are  inherent  to  the  Markov  model.  In  [21],  we 
extend  our  study  to  more  general  active  defense  dynamics,  which  actually  exhibits  richer 
phenomena  such  as  Bifurcation  and  Chaos.  We  discuss  their  fundamental  implications  for 
cybersecurity  (e.g.,  measurability  and  predictability). 

Modeling  and  Characterizing  Push-  and  Pull-based  Epidemic  Spreading  in  Net¬ 
works:  Thresholds  and  Deeper  Insights.  In  [23],  we  investigate  a  general  cyber  attack- 
defense  dynamics  model  that  accommodate,  in  arbitrary  networks,  both  push-based  infection, 
namely  that  a  compromised  node  always  actively  attempts  to  attack  its  neighboring  nodes 
and  pull-based  infection,  which  accommodates  attacks  such  as  “drive-by  download”  attack 
(e.g.,  a  vulnerable  computer  getting  compromised  after  connecting  to  a  compromised  web¬ 
site).  This  paper  significantly  deepens  our  understanding  of  push-  and  pull-based  epidemic 
spreading  dynamics  in  arbitrary  networks.  Specifically,  our  main  contributions  are:  (1) 
We  present  a  general  sufficient  condition  (also  known  as  epidemic  threshold)  under  which 
the  spreading  becomes  stable.  The  threshold  supersedes  the  existing  ones  derived  in  the 
push-based  infection  models.  (2)  We  give  both  upper  and  lower  bounds  on  the  global  mean 
infection  rate;  the  bounds  are  not  always  tight  but  do  allow  us  to  draw  interesting  and  useful 
observations.  (3)  When  the  spreading  is  stable,  we  draw  the  following  deeper  insights  into: 
the  role  of  node  degree  in  governing  node  infection  rate;  conditions  under  which  the  popular 
mean  field  approach  can  be  used  in  arbitrary  networks;  estimating  the  global  mean  infection 
rate  through  localized  monitoring  of  a  small  number  of  nodes  without  knowing  the  values  of 
the  parameters  and  independent  of  the  size  of  the  network. 

Modeling  and  Characterizing  Multi-Virus  Dynamics.  Understanding  the  spreading 
dynamics  of  computer  viruses  (worms,  attacks)  is  an  important  research  problem,  and  has 
received  much  attention  from  the  communities  of  both  computer  security  and  statistical 
physics.  However,  previous  studies  have  mainly  focused  on  single-virus  spreading  dynamics. 
In  [25],  we  study  multi-virus  spreading  dynamics,  where  multiple  viruses  attempt  to  infect 
computers  while  possibly  combating  against  each  other  because,  for  example,  they  are  con¬ 
trolled  by  multiple  botmasters.  Specifically,  we  propose  and  analyze  a  general  model  (and 
its  two  special  cases)  of  multi- virus  spreading  dynamics  in  arbitrary  networks  (i.e.,  we  do 
not  make  any  restriction  on  network  topologies),  where  the  viruses  may  or  may  not  co-reside 
on  computers.  Our  model  offers  analytical  results  for  addressing  questions  such  as:  What 
are  the  sufficient  conditions  (also  known  as  epidemic  thresholds)  under  which  the  multiple 
viruses  will  die  out?  What  if  some  viruses  can  “rob”  others?  What  characteristics  does  the 
multi-virus  epidemic  dynamics  exhibit  when  the  viruses  are  (approximately)  equally  pow¬ 
erful?  The  analytical  results  make  a  fundamental  connection  between  two  types  of  factors: 
defense  capability  and  network  connectivity.  This  allows  us  to  draw  various  insights  that 
can  be  used  to  guide  security  defense. 

2.2.3  Quantitative  Cybersecurity:  Towards  the  Science  of  Cyber  Security 

Quantitative  Cybersecurity  Models  without  Making  the  Poisson  Assumption. 

Quantitative  security  analysis  of  networked  computer  systems  has  been  an  open  problem  in 
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computer  security  for  decades.  Recently,  a  promising  approach  was  proposed  in  [5],  which, 
however,  made  some  strong  assumptions  including  the  exponential  distribution  of,  and  the 
independence  among,  the  relevant  random  variables.  In  [17],  we  substantially  weaken  these 
assumptions  while  offering,  in  addition  to  the  same  types  of  analytical  results  as  in  [5], 
methods  for  obtaining  the  desired  security  quantities  in  practice.  Moreover,  we  investigate 
the  problem  from  a  higher-level  abstraction,  which  also  leads  to  both  analytical  results  and 
practical  methods  for  obtaining  the  desired  security  quantities.  These  should  represent  a 
significant  step  toward  ultimately  solving  the  problem  of  quantitative  security  analysis  of 
networked  computer  systems. 

Quantitative  Cybersecurity  Models  without  Making  the  Independence  Assump¬ 
tion.  Models  of  epidemics  over  arbitrary  complex  networks  have  a  great  potential  to  be¬ 
come  one  of  the  pillars  in  the  emerging  foundation  of  cyber  security.  However,  this  will 
not  happen  until  after  we  have  resolved  a  set  of  challenges.  A  particular  challenge  is  to 
tame  the  dependence  in  cyber  epidemic  process/dynamics  models.  Due  to  its  notorious  dif¬ 
ficulty,  essentially  all  existing  relevant  cyber  epidemic  models  assumed  away  the  dependence 
between  the  relevant  random  variables.  In  [18],  we  move  a  significant  first  step  towards 
ultimately  taming  the  dependence  challenge.  Specifically,  we  present  an  epidemic  (more 
precisely,  attack-defense)  process  model  for  cyber  security.  The  model  is  based  on  a  gener¬ 
alization  of  a  state-of-the-art  cyber  attack-defense  dynamics  model  that  was  introduced  in 
[5]  and  thoroughly  investigated  in  [23].  Our  approach  to  tackling  dependence  is  based  on 
Copulas,  but  our  main  results  are  obtained  without  assuming  any  specific  Copula  structures 
(i.e.,  they  hold  for  arbitrary  and  unknown  dependence  structures).  We  also  characterize  the 
price  of  assuming  away  the  due  dependence  in  cyber  epidemic  models. 

Rich  Phenomena  Exhibited  by  Cyber  Attacks:  Data-driven  Statistical  Charac¬ 
terization  of  Cyber  Attacks.  In  [26],  we  report  the  Long-Range  Dependence  (LRD) 
phenomenon  that  is  exhibited  by  honeypot-captured  cyber  attacks.  To  our  knowledge,  LRD 
was  not  known  to  be  relevant  in  the  cyber  security  domain  until  now,  although  it  has 
been  known  for  two  decades  that  LRD  is  exhibited  by  benign  traffic  (i.e.,  no  attacks).  We 
demonstrate  the  cyber  security  implications  of  the  LRD  phenomenon,  by  showing  how  to 
predict  the  rates  of  incoming  attacks  with  reasonable  good  precision,  especially  at  the  scale 
of  “aggregation”  the  victims  as  a  network.  This  implies  the  possibility  of  providing  real-life 
defenders  with  sufficient  early-warning  time  for  adjusting  their  defense  configurations  (e.g., 
proactively  allocating  more  resources  for  deep  packet  inspection). We  also  explore  the  cause 
of  the  LRD  phenomenon.  We  find  that  the  theory,  which  was  developed  for  explaining  the 
possible  cause  of  LRD  in  benign  traffic,  is  probably  not  applicable  to  the  LRD  observed  in 
the  cyber  security  domain.  We  further  find  that  LRD  observed  in  the  cyber  security  domain 
is  unlikely  caused  by  the  intensity  of  attacks  that  are  launched  by  individual  attackers. 
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3  Technology  Transfer  to,  and  Collaboration  with,  AFRL 

Through  AFRL  researcher  Dr.  Keesook  Han,  the  project  delivered: 

•  Tens  of  thousands  malware  samples  to  AFRL. 

•  Daily  updated  list  of  malicious  websites  to  AFRL  as  well  as  MITRE. 

•  Two  joint  provisional  patent  applications  [16]  in  collaboration  with  AFRL  researchers 
Dr.  Keesook  Han  and  Mr.  Frank  Born.  These  works  were  partly  done  when  the  PI 
(Dr.  Shouhuai  Xu)  was  visiting  AFRL/Rome  in  the  summer  of  2011  (under  AFOSR 
SFFP). 

The  supplementary  equipment  grant  included  five  subcontractors.  All  the  subcontrac¬ 
tors  have  benefited  from  the  equipment  grant,  which  will  continue  benefiting  their  research 
and  collaboration  with  AFRL  researchers.  The  reports  of  the  subcontracts  are  attached  in 
Appendix  B. 

4  DoD  Interest 

The  research  results  are  of  high  DoD  interests.  On  one  hand,  the  systems  research  results 
paved  the  way  for  designing  and  realizing  the  next  generation  malware/bot  (or  more  gener¬ 
ally,  cyber)  defense  technologies.  Such  technologies  can  be  directly  deployed  to  protect  DoD 
cyber  systems,  with  particular  emphasis  on  combating  stealthy  malwares/bots,  zero-day  at¬ 
tacks,  and  Advanced  Persistent  Threats.  On  the  other  hand,  the  theoretical/foundational 
research  made  substantial  progresses  towards  the  Science  of  Security,  which  is  the  Holy  Grail 
challenge  and  has  extremely  high  relevance  to  DoD  missions.  For  example,  such  foundational 
understanding  and  models  can  lead  to  real-time  decision-making  tools  for  cyber  operations 
and  mission  assurance. 
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Abstract 

Malicious  websites  have  become  a  major  attack  tool  of  the  adversary.  Detection  of  malicious  websites  in  real-time 
can  facilitate  early-warning  and  filtering  the  contents  from,  and  the  accesses  to,  the  malicious  websites.  There  are 
two  main  approaches  to  detecting  malicious  websites:  static  and  dynamic.  The  static  approach  is  centered  on  the 
analysis  of  website  contents,  and  thus  can  automatically  detect  malicious  websites  in  a  very  efficient  fashion  and  can 
scale  up  to  a  large  number  of  websites.  However,  this  approach  has  limited  success  in  dealing  with  sophisticated 
attacks  that  include  obfuscation.  The  dynamic  approach  is  centered  on  the  analysis  of  website  contents  via  their 
run-time  behavior,  and  thus  can  cope  with  these  sophisticated  attacks.  However,  this  approach  is  often  expensive 
and  cannot  scale  up  to  the  magnitude  of  the  number  of  websites  in  cyberspace.  In  this  paper,  we  propose  a  novel 
cross-layer  solution  that  can  inherit  the  advantages  of  the  static  approach  while  overcoming  its  drawbacks.  Our 
solution  is  centered  on  the  following:  (i)  application-layer  web  contents,  which  were  typically  analyzed  in  the  static 
approach,  may  not  provide  sufficient  information  for  detection;  (ii)  network-layer  traffic  corresponding  to  application- 
layer  communications  might  provide  extra  information  that  can  be  exploited  to  substantially  enhance  the  detection  of 
malicious  websites.  Evaluation  of  our  cross-layer  detection  is  based  on  real-life  data  that  we  collected.  In  order  to 
deal  with  attacks  that  attempt  to  evade  our  cross-layer  detection,  we  investigate  the  resilience  of  our  solution  against 
adaptive  attacks.  We  demonstrate  that  adaptive  attacks  can  easily  evade  detections  that  include  our  own,  and  propose 
algorithms  for  effectively  defending  against  adaptive  attacks. 
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1  Introduction 

Malicious  websites  have  become  a  severe  cyber  threat  because  they  can  cause  the  automatic  download  and  execution 
of  malware  in  browsers,  and  thus  compromise  vulnerable  computers  [43].  The  phenomenon  of  malicious  websites  will 
persevere  at  least  in  the  near  future  because  we  cannot  prevent  websites  from  being  compromised  or  abused.  Existing 
approaches  to  detecting  malicious  websites  can  be  classified  into  two  categories: 

•  The  static  approach  aims  to  detect  malicious  websites  by  analyzing  their  URLs  [31,  32]  or  their  contents  [51]. 
This  approach  is  very  efficient  and  thus  can  scale  up  to  deal  with  the  huge  population  of  websites  in  cyberspace. 
This  approach  however  has  trouble  coping  with  sophisticated  attacks  that  include  obfuscation  [47],  and  thus  can 
cause  high  false-negative  rate  by  classifying  malicious  websites  as  benign  ones. 

•  The  dynamic  approach  aims  to  detect  malicious  websites  by  analyzing  their  run-time  behavior  using  Client 
Honeypots  or  their  like  [50,  53,  4,  3].  Assuming  the  underlying  detection  is  competent,  this  approach  is  very 
effective.  This  approach  however  is  resource-consuming  because  it  runs  or  emulates  the  browser  and  possibly 
the  operating  system  [11].  As  a  consequence,  this  approach  cannot  scale  up  to  deal  with  the  large  number  of 
websites  in  cyberspace. 

Because  of  the  above,  it  has  been  advocated  to  use  a  front-end  light-weight  tool,  which  is  mainly  based  on  static 
analysis  and  aims  to  rapidly  detect  suspicious  websites,  and  a  back-end  more  powerful  but  much  slower  tool,  which 
conducts  a  deeper  analysis  of  the  detected  suspicious  websites. 

While  conceptually  attractive,  the  success  of  this  hybrid  approach  fundamentally  relies  on  the  assumption  that 
the  front-end  static  analysis  does  have  very  low  false-negative  rates;  otherwise,  many  malicious  websites  will  not  be 
detected  even  if  the  back-end  dynamic  analysis  tools  are  powerful.  However,  this  assumption  can  be  easily  violated 
because  of  the  following:  First,  in  real  life,  the  attacker  could  defeat  pure  static  analysis  by  exploiting  various  so¬ 
phisticated  techniques  such  as  obfuscation  and  redirection.  Second,  the  attacker  can  get  the  same  data  and  therefore 
use  the  same  machine  learning  algorithms  to  derive  the  defender’s  classifiers.  This  is  plausible  because  in  view  of 
Kerckhoffs’s  Principle  in  cryptography,  we  should  assume  that  the  defender’s  learning  algorithms  are  known  to  the 
attacker.  As  a  consequence,  the  attacker  can  always  act  one  step  ahead  of  the  defender  by  adjusting  its  activities  so  as 
to  evade  detection.  The  above  two  issues  lead  to  the  following  question:  how  can  we  achieve  the  best  of  both  static  and 
dynamic  analysis,  and  go  beyond?  The  question  is  important  and  motivates  the  investigation  presented  in  this  paper. 

Our  contributions: 

Towards  the  ultimate  goal  of  fully  addressing  the  above  two  issues,  in  this  paper  we  propose  a  novel  solution  to  the 
problem  of  detecting  malicious  websites.  At  a  high-level,  our  solution  analyzes  the  website  contents  as  well  as  the 
redirection  website  contents  in  the  fashion  of  the  static  approach,  while  taking  advantage  of  the  network-layer  traffic 
information.  More  specifically,  our  contributions  are  the  following.  First,  we  use  static  analysis  to  proactively  track 
redirections,  which  have  been  abused  by  the  attacker  to  hide  malicious  websites.  Our  study  shows  that  static  analysis 
can  be  extended  to  track  redirections  and  detect  many  malicious  websites. 

Second,  we  exploit  the  network-layer  traffic  information  to  gain  significant  extra  detection  capabilities.  For  ex¬ 
ample,  we  find  that  network-layer  J48  classifier  [45]  achieves  99.91%  detection  accuracy,  0.47%  false-negative  rate 
and  0.03%  false-positive  rate.  They  are  significantly  better  than  their  application-layer  counterpart,  which  achieves 
98.99%  detection  accuracy,  7.63%  false-negative  rate  and  0.03%  false-positive  rate.  Moreover,  using  the  PCA  (Princi¬ 
ple  Component  Analysis  [24])  feature  selection  method  that  leads  to  80  features  that  are  used  in  the  classifier  training 
process,  cross-layer  J48  classifier  achieves  99.94%  detection  accuracy,  0.06%  false-negative  rate,  and  0.05%  false¬ 
positive  rate.  With  the  feature  selection  method  called  GainRat  ioAttributeEval  [24],  cross-layer  148  classifier 
achieves  99.91%  detection  accuracy,  0.477%  false-negative  rate,  and  0.03%  false-positive  rate.  This  is  a  somewhat 
surprising  result  because  the  training  process  is  based  on  9  out  of  the  124  features  in  total.  Among  the  9  features,  5  are 
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application-layer  ones  and  4  are  network-layer  ones.  This  suggests  that  the  data  collection  system  can  be  made  much 
more  efficient  because  it  only  needs  to  collect  a  few  critical  features. 

Third,  we  present  three  adaptation  algorithms  that  may  be  used  by  the  attacker  to  launch  adaptive  attacks,  and  can 
be  exploited  by  the  defender  to  launch  adaptive  defense.  Our  algorithms  allow  us  to  demonstrate  how  easy  it  can  be 
for  an  adaptive  attacker  to  evade  non-adaptive  detection.  We  also  show  how  our  algorithms  can  effectively  deal  with 
adaptive  attacks,  and  thus  make  our  detection  system  resilient  to  adaptive  attacks.  This  is  true  even  if  a  few  features 
are  used  for  training  J48  classifiers. 

Evaluation  of  our  solution  is  based  on  real  data,  except  for  resilience  analysis  where  we  manipulate  real  data  to 
mimic  adaptive  attacks.  We  have  developed  a  detection  system  that  serves  as  the  fast  and  scalable  front-end  of  a 
comprehensive  solution,  which  is  depicted  in  Figure  1 . 


Input  URL 


Figure  1 :  The  relationship  between  our  detection  system  and  a  comprehensive  solution. 

Paper  organization: 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  our  cross-layer  data  collection  and  analysis.  Section 
3  investigates  the  resilience  of  our  cross-layer  detection  system.  Section  4  discusses  related  prior  work.  Section  5 
elaborates  the  limitations  of  our  system  and  suggests  future  research  directions.  Section  6  concludes  the  paper. 

Summary  of  main  notations: 

The  following  table  summarizes  the  main  notations  used  in  the  paper. 


xt 

the  ^th  feature  of  feature  vector  X 

mmz ,  max2 

min  and  max  value  of  the  zth  feature 

applying  classifier  Mi  to  dataset  Dj  in  adaptive  attack  setting 

Mo-i(Dj) 

the  majority  vote  of  Mo(Dj),  Mi(Dj),  . . .,  Mi(Dj)  in  adaptive  attack  setting 

f,9,h 

functions  the  attacker  uses  to  manipulate  its  behavior  in  adaptive  attack  setting 

2  Cross-layer  Analysis  and  Detection  of  Malicious  Websites 

In  this  section,  we  describe  the  architecture  of  our  cross-layer  data  collection  system,  and  present  our  analysis  method¬ 
ology  and  the  evaluation  results  of  our  classifiers. 

2.1  Cross-layer  Data  Collection  and  Pre-processing 

Data  collection  method  and  system  architecture: 

In  order  to  facilitate  cross-layer  analysis  and  detection,  we  built  an  automated  system  to  collect  both  the  application- 
layer  URL  contents  and  the  resulting  network-layer  traffic.  The  architecture  of  our  automated  data  collection  system 
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is  depicted  in  Figure  2.  At  a  high-level,  our  data  collection  system  is  centered  on  a  crawler,  which  takes  a  list  of  URLs 
as  input,  automatically  fetches  the  website  contents  by  launching  HTTP/HTTPS  requests  to  the  target  URLs,  and 
tracks  the  redirections  it  identified  from  the  website  contents  (elaborated  below).  The  crawler  further  uses  the  URLs, 
including  both  the  input  ones  and  the  resulting  redirection  URLs,  to  query  the  DNS,  Whois,  and  Geographic  services 
for  collecting  relevant  features  for  analysis.  The  application-layer  web  contents  and  the  corresponding  network-layer 
IP  packets  are  recorded  separately,  but  are  indexed  by  the  input  URLs  to  facilitate  cross-layer  analysis.  The  collected 
application-layer  raw  data  are  pre-processed  to  make  them  suitable  for  machine  learning  tasks  (also  elaborated  below). 


Input  URLs 


Figure  2:  Data  collection  system  architecture. 

Statically  and  proactively  tracking  redirects: 

The  data  collection  system  proactively  tracks  redirections  by  analyzing  the  website  contents  in  a  static  fashion,  which 
means  that  our  solution  is  as  fast  and  scalable  as  the  static  approach.  Specifically,  we  consider  the  following  four 
types  of  redirects.  The  first  type  is  server  side  redirects  that  are  initiated  either  by  server  rules  (i.e.,  .  htaccess  file) 
or  server  side  page  code  such  as  PHP.  These  redirects  often  utilize  HTTP  300  level  status  codes.  The  second  type  is 
JavaScript-based  redirects.  Despite  extensive  study,  there  has  been  limited  success  in  dealing  with  JavaScript-based 
redirection  that  is  coupled  with  obfuscation  [19].  The  third  type  is  the  refresh  Meta  tag  and  HTTP  refresh  header, 
which  allow  one  to  specify  the  URLs  of  the  redirection  pages.  The  fourth  type  is  embedded  file  redirects.  Exam¬ 
ples  of  this  type  are:  -ciframe  src=' badsite  .  php' />,  <img  src=' badsite  .  php' />,  and  <script 
src=' badsite . php'  >  </script>. 

It  is  important  to  understand  that  the  vast  majority  of  malicious  URLs  are  actually  victim  sites  that  have  themselves 
been  hacked.  Sophos  Corporation  has  identified  the  percentage  of  malicious  code  that  is  hosted  on  hacked  sites  as 
90%  [5].  Most  often  this  malicious  code  is  implanted  using  SQL  injection  methods  and  shows  up  in  the  form  of 
an  embedded  file  as  identified  above.  In  addition,  stolen  ftp  credentials  allow  hackers  to  have  direct  access  to  files 
where  they  can  implant  malicious  code  directly  into  the  body  of  a  web  page.  The  value  of  the  embedded  file  method 
to  hackers  is  that,  through  redirections  and  changing  out  back  end  code  and  file  references,  they  can  better  hide  the 
malicious  nature  of  these  embedded  links  from  search  engines  and  browsers. 

Description  and  pre-processing  of  application-layer  data: 

The  resulting  application-layer  data  have  105  features  in  total,  which  are  obtained  after  pre-processing  the  collected 
application-layer  raw  data.  The  application-layer  raw  data  consist  of  feature  vectors  that  correspond  to  the  respective 
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input  URLs.  Each  feature  vector  consists  of  various  features,  including  information  such  as  HTTP  header  fields,  and 
information  obtained  by  using  the  input  URLs  and  the  detected  redirection  URLs  to  query  DNS  and  Whois  services. 
In  particular,  redirection  information  includes  (i)  redirection  method,  (ii)  whether  a  redirection  points  to  a  different 
domain,  (iii)  the  number  of  redirection  hops. 

Because  different  URLs  may  involve  different  numbers  of  redirection  hops,  different  URLs  may  have  different 
numbers  of  features.  This  means  that  the  application-layer  raw  feature  vectors  do  not  necessarily  have  the  same  number 
of  features,  and  thus  cannot  be  processed  by  both  classifier  learning  algorithms  and  classifiers  themselves.  We  resolve 
this  issue  by  aggregating  multiple-hop  information  into  artificial  single-hop  information  as  follows;  for  numerical  data, 
we  aggregate  them  by  using  their  average  instead;  for  boolean  data,  we  aggregate  them  by  taking  the  OR  operation; 
for  nominal  data,  we  only  consider  the  final  destination  URL  of  the  redirection  chain.  For  example,  suppose  that  an 
input  URL  is  redirected  twice  to  reach  the  final  destination  URL  and  the  features  are  (Content-Length,  “Does 
JavaScript  function  eval  ( )  exist  in  the  code?”,  Country).  Suppose  that  the  raw  feature  vectors  corresponding  to 
the  input,  first  redirection,  and  second  redirection  URLs  are  (100,  FALSE,  US),  (200,  FALSE,  UK),  and  (300,  TRUE, 
RUSSIA),  respectively.  We  aggregate  the  three  raw  feature  vectors  as  (200,  TRUE,  RUSSIA),  which  is  stored  in  the 
application-layer  data  for  analysis. 

Description  of  network-layer  data: 

Network-layer  features  are  extracted  from  the  corresponding  PCAP  (Packet  CAPture)  files  that  are  recorded  when 
the  crawler  accesses  the  input  URLs.  There  are  19  network-layer  features  that  are  derived  from  IP  level,  UDP/TCP 
level  or  flow  level,  where  a  flow  is  uniquely  identified  by  a  tuple  (source  IP,  source  port  #,  destination  IP,  destination 
port  #,  protocol).  Some  examples  of  network-layer  features  are;  I  at  .flow,  which  is  the  cumulative  inter-arrival 
time  between  the  flows  caused  by  the  access  to  an  input  URL;  DNS_query_times,  which  is  the  total  number  of 
DNS  queries  caused  by  the  access  to  an  input  URL;  TCP.conversation.exchange,  which  is  the  number  of 
conversation  exchanges  in  the  TCP  connections  caused  by  the  crawler’s  access  to  an  input  URL;  IP_packets,  which 
is  the  number  of  IP  packets  caused  by  the  access  to  an  input  URL;  Avg_remote_rate,  which  is  the  rate  the  remote 
server  sends  to  local  host  (packets  per  second)  during  the  crawler’s  access  to  an  input  URL. 

2.2  Cross-layer  Data  Analysis  Methodology 

Classifier  accuracy  metrics: 

Suppose  that  the  defender  learned  a  classifier  M  from  some  training  data.  Suppose  that  the  defender  is  given  test 
dataset  D,  which  consists  of  <l  \  malicious  URLs  and  d2  benign  URLs.  Suppose  further  that  among  the  (1\  malicious 
URLs,  M  correctly  detected  d[  of  them,  and  that  among  the  d2  benign  URLs,  M  correctly  detected  d'2  of  them.  The 
detection  accuracy  of  M  is  defined  as  gyygA  The  false-positive  rate  is  defined  as  d2d^2 .  The  true-positive  rate  is 

defined  as  ^ .  The  false-negative  rate  is  defined  as  .  Ideally,  we  want  a  classifier  to  achieve  high  detection 

accuracy,  low  false-positive  rate  and  low  false-negative  rate. 

Data  analysis  methodology: 

Our  analysis  methodology  was  geared  towards  answering  questions  about  the  power  of  cross-layer  detection.  It  has 
two  steps,  which  are  equally  applicable  to  both  application-layer  and  network-layer  data. 

1.  Feature  selection  (optional):  Because  there  are  124  features  in  total,  we  may  need  to  conduct  feature  selection. 
We  used  the  following  three  feature  selection  methods.  The  first  method  is  “P  CA  with  the  Ranker  search  method” 
(PCA  for  short)  that  transforms  a  set  of  feature  vectors  to  a  set  of  shorter  feature  vectors  [24]. 

The  second  feature  selection  method  is  called  “Cf  sSubsetEval  with  the  Best-First  search  method”  in  the 
Weka  toolbox  (Cf  sSubsetEval  for  short)  [24].  It  essentially  computes  the  features’  prediction  power,  and 
its  selection  algorithm  essentially  ranks  the  features’  contributions  [23].  It  outputs  a  subset  of  features  that  are 
substantially  correlated  with  the  class  (benign  or  malicious)  but  have  low  inter-feature  correlations. 

The  third  feature  selection  method  is  called  “GainRatioAttributeEval  with  the  Ranker  search  method” 
(GainRatioAttributeEval  for  short)  in  the  Weka  toolbox  [24].  Its  evaluation  algorithm  essentially  com¬ 
putes  the  information  gain  ratio  (or  more  intuitively  the  importance  of  each  feature)  with  respect  to  the  class,  and 
its  selection  algorithm  ranks  features  based  on  their  information  gains  [16].  It  outputs  the  ranks  of  all  features 
in  the  order  of  decreasing  importance. 
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2.  Model  learning  and  validation:  We  used  four  popular  learning  algorithms:  Naive  Bayes,  Logistic,  SVM,  and 
J48,  which  have  been  implemented  in  the  Weka  toolbox  [24].  Naive  Bayes  classifier  is  a  probabilistic  classifier 
based  on  Bayes’  rule  [25].  Logistic  classifier  [29]  is  one  kind  of  linear  classification,  where  the  domain  of  the 
target  variable  is  {0, 1}.  SVM  (Support  Vector  Machine)  classifier  aims  to  find  an  maximum-margin  hyperplane 
for  separating  different  classes  in  the  training  data  [14].  We  use  the  SMO  (Sequential  Minimal-Optimization) 
algorithm  in  our  experiment  with  polynomial  kernel  function,  which  is  efficient  [42].  J48  classifier  is  the  Weka 
implementation  of  C4.5  decision  trees  [45],  which  can  be  used  for  binary  classification. 

For  cross-layer  data  analysis,  we  consider  the  following  two  cross-layer  aggregation  methods. 

•  Data-level  aggregation.  The  application-layer  feature  vector  and  the  network-layer  feature  vector  with  respect 
to  the  same  URL  are  simply  merged  into  a  single  longer  feature  vector.  This  is  possible  because  the  vectors 
are  indexed  by  the  input  URLs.  In  this  case,  the  data-level  aggregation  operation  is  conducted  before  the  above 
two-step  data  analysis  process. 

•  Model-level  aggregation.  The  decision  on  whether  a  website  is  malicious  is  based  on  the  decisions  of  the 
application-layer  classifier  and  the  network-layer  classifier.  There  are  two  options.  One  option  is  that  a  website 
is  classified  as  malicious  if  the  application-layer  classifier  or  the  network-layer  classifier  says  it  is  malicious; 
otherwise,  it  is  classified  as  benign.  We  call  this  OR-aggregation.  The  other  option  is  that  a  website  is  classified  as 
malicious  if  both  the  application-layer  classifier  and  the  network-layer  classifier  say  it  is  malicious;  otherwise, 
it  is  classified  as  benign.  We  call  this  AND-aggregation.  In  either  case,  both  application-  and  network-layer 
data  are  processed  using  the  above  two-step  data  analysis  process.  Then,  the  resulting  classifiers  are  further 
aggregated  using  the  OR  or  AND  operation. 

Datasets  description: 

Our  dataset  consists  of  1,467  malicious  URLs  and  10,000  benign  URLs.  The  malicious  URLs  are  part  of  the 
28,018  blacklist  URLs  downloaded  from  malware.com.br,  malwaredomainlist.com  and  compuweb. 
com/ur  1-domain-bl .  txt  on  the  same  day  as  we  conducted  the  experiment  (Nov.  3,  2011),  and  are  confirmed 
as  malicious  by  the  high-interaction  client  honeypot  Capture-HPC  version  3.0  [50].  Our  test  of  blacklist  URLs  using 
high-interaction  client  honeypot  confirmed  our  observation  that  some  or  many  currently  blacklisted  URLs  are  not  ac¬ 
cessible  any  more  and  thus  should  not  be  counted  as  malicious  URLs.  The  10,000  benign  URLs  are  obtained  from 
alexa  .  com,  which  lists  the  top  10,000  websites  that  are  supposed  to  be  well  protected. 

2.3  On  the  Power  of  Cross-Layer  Detection 

In  order  to  identify  the  more  powerful  classifiers,  we  compare  the  aforementioned  four  classifiers,  with  and  without 
feature  selections.  Table  1  describes  the  results  without  using  feature  selection,  using  the  PCA  and  Cf  s  Subset  Eval 
feature  selections.  We  make  the  following  observations.  First,  by  taking  a  horizontal  perspective,  we  observe  that 
for  cross-layer  detection  with  or  without  using  feature  selection,  J48  classifier  often  performs  better  than  the  other 
three  classifiers.  Moreover,  for  data-level  aggregation,  J48  classifier  achieves  the  highest  detection  accuracy  and  the 
lowest  false-negative  rate.  For  OR-aggregation,  J48  classifier  achieves  the  highest  detection  accuracy  and  the  lowest 
false-positive  rate.  For  AND-aggregation,  J48  classifier  naturally  leads  to  the  lowest  false-positive  rate,  but  also  causes 
a  relatively  high  false-negative  rate. 

Second,  by  taking  a  vertical  perspective,  we  observe  that  J48  classifier  based  on  data-level  aggregation  and  without 
using  feature  selection  achieves  the  highest  detection  accuracy  and  the  lowest  false-negative  rate.  Given  that  we  have 
singled  out  data-level  aggregation  and  OR-aggregation  cross-layer  J48  classifier,  let  us  now  look  at  whether  using 
feature  selection  will  jeopardize  classifier  quality.  We  observe  that  using  PCA  feature  selection  actually  leads  to 
roughly  the  same,  if  not  better,  detection  accuracy,  false-negative  rate,  and  false-positive  rate. 

In  the  case  of  data-level  aggregation,  J48  classifier  can  be  trained  using  80  features  that  are  derived  from  the  124 
features  using  PCA;  the  Cf  sSubsetEval  feature  selection  method  actually  leads  to  the  use  of  four  network-layer 
features:  (1)  Local_app_bytes,  which  is  the  accumulated  application  bytes  of  TCP  packets  sent  from  local  host 
to  the  remote  server.  (2)  Dist_remote_tcp_port,  which  is  the  cumulative  number  of  distinct  TCP  ports  that 
have  been  used  by  the  remote  server.  (3)  Iat_f  low,  which  is  the  accumulated  inter-arrival  time  between  flows. 
(4)  Avg.remote.rate,  which  is  the  rate  the  remote  server  sends  to  local  host  (packets  per  second).  This  can  be 
explained  as  follows:  malicious  websites  that  contain  malicious  JavaScript  code  or  contents  can  cause  frequent  and 
large  volume  communications  between  remote  servers  and  local  hosts.  Note  that  this  does  not  contradict  the  fact  that 
our  crawler  is  a  static  analysis  tool.  This  is  because  our  crawler  program  only  downloads  the  source  code  of  web 
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Layer 

Feature  selection 

Naive  Bayes 

Logistic 

SVM 

J48 

method 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Application-layer 

none 

98.54 

11.31 

0.01 

99.87 

0.27 

0.1 

98.92 

7.43 

0.13 

98.99 

7.63 

0.03 

PCA 

94.44 

1.43 

6.16 

99.76 

1.64 

0.04 

99.60 

2.93 

0.02 

99.88 

0.68 

0.03 

Cfs Subset Eval 

98.45 

4.23 

1.16 

99.81 

1.30 

0.03 

99.69 

2.38 

0.0 

99.80 

1.29 

0.03 

Network-layer 

none 

98.60 

1.91 

1.32 

99.90 

0.61 

0.03 

99.75 

1.90 

0.0 

99.91 

0.47 

0.03 

PCA 

78.09 

55.94 

9.39 

79.94 

58.09 

6.07 

78.44 

69.69 

3.85 

94.88 

9.32 

3.57 

Cf sSubsetEval 

77.86 

72.25 

3.71 

80.88 

56.89 

5.23 

77.56 

79.48 

1.46 

95.71 

6.77 

3.38 

Cross-layer 

none 

99.75 

1.84 

0.01 

99.79 

0.74 

0.12 

99.78 

1.70 

0.0 

99.91 

0.47 

0.03 

(data-level  agg.) 

PCA 

87.32 

24.47 

10.94 

99.61 

1.29 

0.25 

99.41 

4.49 

0.01 

99.94 

0.06 

0.05 

Cf sSubsetEval 

98.44 

4.22 

1.16 

99.80 

1.29 

0.03 

99.69 

2.38 

0.0 

99.80 

1.29 

0.03 

Cross-layer 

none 

98.65 

1.50 

1.33 

99.89 

0.00 

0.13 

99.63 

1.91 

0.14 

99.89 

0.48 

0.06 

(OR-aggregation) 

PCA 

85.82 

1.43 

16.05 

99.28 

1.64 

0.59 

98.97 

2.93 

0.75 

99.92 

0.00 

0.09 

Cf sSubsetEval 

97.97 

1.23 

2.15 

98.63 

1.30 

1.38 

97.61 

2.39 

2.39 

98.97 

1.30 

0.99 

Cross-layer 

none 

98.50 

11.72 

0.00 

99.89 

0.89 

0.00 

99.05 

7.43 

0.00 

99.02 

7.63 

0.00 

(AND-aggregation) 

PCA 

91.83 

58.55 

0.78 

98.93 

8.38 

0.00 

98.91 

8.52 

0.00 

99.81 

1.50 

0.00 

Cf sSubsetEval 

98.67 

10.43 

0.00 

99.05 

7.43 

0.00 

95.13 

38.10 

0.00 

99.05 

7.43 

0.00 

Table  1:  Comparison  (%)  between  no  feature  selection  and  two  feature  selection  methods  (Acc.:  detection  accuracy;  FN:  false¬ 
negative  rate;  FP:  false-positive  rate). 


pages,  parses  its  content  but  does  not  execute  it.  Our  crawler  is  different  from  traditional  web  browsers  that  have  a 
display  engine  and  dynamically  execute  JavaScript  code. 


(a)  Application-layer 


rr** 

Naive  Bayes 
SVM 

— 

Logistic 

J48  □ 

3  6  9  12  15 

#  of  features  used  for  learning 

(b)  Network-layer 


#  of  features  used  for  learning 
(c)  Cross-layer  (data-level  agg.) 


Figure  3:  Detection  accuracy  (%)  vs.  the  number  of  features  selected  from  training  classifiers  (in  order  of  decreasing  significance). 
When  no  features  are  used,  the  classifier  can  class  all  URLs  as  benign,  which  implies  87.21%  detection  accuracy  because  there  are 
87.21%  benign  URLs. 


2.4  On  the  Feasibility  of  Using  a  Few  Features  for  Learning  Classifiers 

How  few  features  can  we  use  to  train  classifiers?  To  answer  this  question,  we  use  the  GainRatioAttributeEval 
feature  selection  method  because  it  actually  ranks  the  contributions  of  the  individual  features.  Figure  3  plots  the  exper¬ 
imental  results,  which  allow  us  to  make  the  following  observations.  For  application-layer,  using  the  following  eleven 
features  already  leads  to  99.01%  detection  accuracy  for  J48  (98.88%,  99.82%,  99.76%  for  Naive  Bayes,  Logistic  and 
SVM,  respectively).  (1)  HTTPHeader_server,  which  is  the  type  of  the  HTTP  server  at  the  redirection  destination 
of  an  input  URL  (e.g.,  Apache,  Microsoft  IIS,  etc.).  (2)  Whois_regDate,  which  is  the  registration  date  of  the 
website  that  corresponds  to  the  redirection  destination  of  an  input  URL.  (3)  HTTPHeader.cacheControl,  which 
indicates  the  cache  management  method  in  the  server  side.  (4)  Whois_stateProv,  which  is  the  registration  state  or 
geographical  location  of  the  website.  (5)  Charset,  which  is  encoding  charset  of  current  URL  (e.g.,  iso-8859-1),  and 
hints  the  language  a  website  used  and  its  target  users’  ethnicity.  (6)  Within_domain,  which  indicates  whether  the 
destination  URL  and  the  original  URL  are  in  the  same  domain.  (7)  Updated_date,  which  is  the  last  update  date  of 
the  final  redirection  destination  website.  (8)  Content_length_val  id,  which  indicates  whether  the  content  length 
field  in  the  HTTP  header  is  valid.  This  is  relevant  because  the  content  length  field  could  be  a  negative  number,  which 
may  cause  buffer  overflow  attacks.  (9)  Redirect_num,  which  is  the  total  number  of  redirects  embedded  into  an 
input  URL  to  destination  URL.  Malicious  websites  often  have  a  larger  number  of  redirects  than  benign  websites.  (10) 
Country,  which  is  the  country  of  the  registrant.  (11)  Protocol,  which  indicates  the  transfer  protocol  a  website 
uses.  Note  that  HTTPS  is  normally  used  by  benign  websites.  When  these  1 1  features  are  used  for  training  classifiers, 
we  can  achieve  detection  accuracy  of  98.22%,  97.03%,  96.69%  and  99.01%  for  Naive  Bayes,  Logistic,  SVM  and  J48 
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classifiers  respectively. 

For  network-layer,  using  the  following  nine  features  can  have  good  detection  accuracy  and  lower  false-negative 
rate.  (1)  Avg_remote_packet_rate,  which  is  the  average  IP  packets  rate  (packets  per  second)  sent  by  the 
remote  server.  For  multiple  remote  IP  addresses,  this  feature  is  the  average  IP  packets  rate  of  the  remote  web¬ 
site.  (2)  Dist_remote_TCP_port,  which  is  the  number  of  distinct  TCP  ports  opened  by  remote  servers.  (3) 
Dist_remote_IP,  which  is  the  number  of  distinct  remote  server  IP  addresses.  (4)  DNS_answer_times,  which 
is  the  number  of  DNS  answers  sent  by  DNS  server.  (5)  Flow_number,  which  is  the  number  of  flows.  (6) 
Avg_local_packet_rate,  which  is  the  average  rate  of  IP  packets  that  are  sent  by  the  local  host  (packets  per 
second).  (7)  DNS_query_times,  which  is  the  number  of  DNS  queries  sent  by  local  host.  (8)  Duration, 
which  is  the  duration  of  time  consumed  for  a  conversation  between  the  local  host  and  the  remote  server.  (9) 
Local_app_packets,  which  is  the  number  of  IP  packets  sent  by  the  local  host  to  the  remote  server.  When  these 
nine  features  are  used  for  training  classifiers,  we  can  achieve  detection  accuracy  of  98.88%,  99.82%,  99.76%  and 
99.91%  for  Naive  Bayes,  Logistic,  SVM  and  J48  classifiers  respectively.  An  explanation  of  this  phenomenon  is  the 
following:  Because  of  redirection,  visiting  malicious  URLs  will  cause  local  host  to  send  multiple  DNS  queries  and 
connect  to  multiple  remote  servers,  which  might  lead  to  high  volume  communications. 

We  observe,  as  expected,  that  J48  classifier  performs  at  least  as  good  as  the  others  in  terms  of  network-layer 
detection  and  cross-layer  detection.  Note  that  in  this  case  we  have  to  compare  the  false-negative  and  false-positive 
rates  with  respect  to  the  number  of  features  that  are  used  for  learning  classifiers.  In  Table  2,  we  summarize  the 
false-negative  and  false-positive  rates  of  the  classifiers  learned  from  a  few  features.  The  five  application-layer  fea¬ 
tures  and  four  network-layer  features  used  in  the  data-level  aggregation  case  are  the  top  five  (out  of  the  eleven) 
GainRatioAttributeEval-selected  features  used  by  the  application-layer  classifier  and  the  top  four  (out  of 
the  nine  selected)  GainRatioAttributeEval-selected  features  used  by  the  network-layer  classifier,  respectively. 
The  eleven  application-layer  features  and  nine  network-layer  features  used  in  the  OR-aggregation  and  AND-aggregation 
are  the  same  as  the  features  that  are  used  in  the  application-layer  and  network-layer  classifiers.  We  make  the  following 
observations.  First,  J48  classifier  learned  from  fewer  application-layer  features,  network-layer  features  and  cross-layer 
features  can  still  maintain  very  close  detection  accuracy  and  false-negative  rate. 


Layer 

number  of 

Naive  Bayes 

Logistic 

SVM 

J48 

features 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Acc. 

FN 

FP 

Application 

11 

98.22 

7.430 

0.95 

97.04 

7.430 

2.3 

96.69 

7.430 

2.7 

98.21 

7.430 

0.96 

Network 

9 

98.79 

1.908 

1.099 

99.81 

1.226 

0.03 

99.75 

1.908 

0.0 

99.90 

0.545 

0.03 

Cross  (data-level  agg.) 

5+4 

98.88 

1.908 

1.0 

99.81 

1.295 

0.02 

99.75 

1.908 

0.0 

99.91 

0.477 

0.03 

Cross  (OR-aggregation) 

11+9 

98.06 

1.23 

2.05 

97.82 

1.23 

2.32 

97.40 

1.91 

2.70 

99.07 

0.55 

0.99 

Cross-layer  (AND-aggregation) 

11+9 

98.96 

8.11 

0.000 

99.04 

7.43 

0.010 

99.05 

7.43 

0.000 

99.05 

7.43 

0.00 

Table  2:  Effect  when  a  few  features  are  used  for  learning  classifiers  (Acc.:  detection  accuracy;  FN:  false-negative  rate;  FP:  false¬ 
positive  rate;  a  +  b:  a  application-layer  features  plus  b  network-layer  features). 

Second,  for  all  the  cross-layer  classifiers  based  on  data-level  aggregation,  five  application-layer  features  (i.e., 
HTTPHeader_server,  Whois_regDate,  HTTPHeader_cacheControl,  Within_domain,  Updated_date) 
and  four  network-layer  features  (i.e.,  Avg_remote_packet_rate,  Dist_remote_TCP_port,  Dist_remote_IP, 
DNS_answer_times)  can  already  achieve  almost  as  good  as,  if  not  better  than,  the  other  scenarios.  In  particular, 
J48  actually  achieves  99.91%  detection  accuracy,  0.477%  false-negative  rate,  and  0.03%  false-positive  rate,  which  is 
comparable  to  the  J48  classifier  learned  from  all  the  124  features,  which  leads  to  99.91%  detection  accuracy,  0.47% 
false-negative  rate,  and  0.03%  false-positive  rate  without  using  any  feature  selection  method  (see  Table  1).  This  means 
that  data-level  aggregation  with  as  few  as  nine  features  is  practical  and  highly  accurate. 

Third,  Naive  Bayes  classifier  exhibits  the  following  interesting  phenomenon:  the  detection  accuracy  actually  drops 
when  more  features  are  used  for  building  classifiers.  One  possible  cause  of  this  “ol'f-trend”  phenomenon  is  that  the 
feature  selection  method  was  not  the  Bayesian  approach  and  hence  not  optimum.  We  leave  it  to  future  work  to 
theoretically  explain  the  cause  of  this  phenomenon. 


2.5  On  the  Practicality  of  the  Cross-layer  System 

As  discussed  in  the  Introduction,  we  aim  to  make  our  system  as  fast  and  scalable  as  the  static  analysis  approach  while 
achieving  high  detection  accuracy,  low  false-negative  rate,  and  low  false-positive  rate  as  the  dynamic  approach.  In 
the  above,  we  have  demonstrated  that  our  cross-layer  system,  which  can  be  based  on  either  the  data-level  aggregation 
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or  the  OR-aggregation  and  even  using  as  few  as  nine  features  in  the  case  of  data-level  aggregation,  achieved  high 
detection  accuracy,  low  false-negative  rate,  and  low  false-positive  rate.  In  what  follows  we  confirm  that,  even  without 
using  any  type  of  optimization  and  collecting  all  the  124  features  rather  than  the  necessary  nine  features,  our  system  is 
at  least  about  25  times  faster  than  the  dynamic  approach.  To  be  fair,  we  should  note  that  we  did  not  consider  the  time 
spent  for  learning  classifiers  and  the  time  spent  for  applying  a  classifier  to  the  data  collected  from  a  given  URL.  This 
is  because  the  learning  process  only  takes  respectively  6.4,  1.15,  and  9.31  seconds  for  learning  the  J48  classifier  from 
the  application-layer,  network-layer,  and  data-level  aggregation  data,  and  because  the  process  of  applying  a  classifier 
to  a  given  data  takes  no  more  than  1  second. 

In  order  to  measure  the  performance  of  our  data  collection  system,  it  would  be  natural  to  use  the  time  spent  on 
collecting  the  cross-layer  data  information  and  the  time  spent  by  the  client  honeypot  system.  Unfortunately,  this  is  not 
feasible  because  our  data  collection  system  is  composed  of  several  computers  with  different  hardware  configurations. 
To  resolve  this  issue,  we  conducted  extra  experiments  using  two  computers  with  the  same  configuration.  One  computer 
will  run  our  data  collection  system  and  the  other  computer  will  run  client  honeypot  system.  The  hardware  of  the  two 
computers  is  Intel  Xeon  X3320  4  cores  CPU  and  8GB  memory.  We  use  Capture-HPC  client  honeypot  version  3.0.0 
and  VMWare  Server  version  1.0.6,  which  runs  on  top  of  a  Host  OS  (Windows  Server  2008)  and  hosts  a  Guest  OS 
(Windows  XP  sp3).  Since  Capture-HPC  is  high-interactive  and  thus  necessarily  heavy-weight,  we  ran  five  guest 
OS  (according  to  our  experiment,  more  guest  OS  will  make  the  system  unstable),  and  used  default  configuration  of 
Capture-HPC. 

Our  data  collection  system  uses  a  crawler,  which  was  written  in  JAVA  1.6  and  runs  on  top  of  Debian  6.0.  Besides 
the  JAVA  based  crawler,  we  also  use  IPTABLES  [2]  and  modified  version  of  TCPDUMP  [6]  to  obtain  high  parallel 
capability.  When  running  multiple  crawler  instances  at  the  same  time,  the  application  features  can  be  obtained  by  each 
crawler  instance,  but  the  network  feature  of  each  URL  should  also  be  extracted.  TCPDUMP  software  can  be  used  to 
capture  all  the  outgoing  and  incoming  network  traffic  on  local  host.  IPTABLES  can  be  configured  to  log  network  flow 
information  with  respect  to  processes  with  different  user  identity.  We  use  different  user  identity  to  run  each  crawler 
instance,  extract  network  flow  information  for  each  URL  and  use  the  flow  attributes  to  extract  all  the  network  packets 
of  a  URL.  Because  our  crawler  is  light-weight,  we  conservatively  ran  50  instances  in  our  experiments. 


Input  URLs 

Our  crawler 

Capture-HPC 

Malicious  (1,562) 

4  min 

98  min 

Benign  (1,500) 

4  min 

101  min 

Table  3:  Time  comparison  between  Capture-HPC  and  our  crawler 

The  input  URLs  in  our  performance  experiments  consist  of  1,562  malicious  URLs  that  are  accessible,  and  1,500 
benign  URLs  that  are  the  first  1,500  websites  listed  by  Alexa .  com.  Table  3  shows  the  performance  of  the  two 
systems.  We  observe  that  our  crawler  is  about  25  times  faster  than  Capture-HPC,  which  demonstrates  the  performance 
gain  of  our  system.  We  note  that  in  the  experiments,  our  cross-layer  data  collection  system  actually  collected  all 
124  features.  The  performance  can  be  further  improved  if  only  the  necessary  features  (nine  in  the  above  data-level 
aggregation  method)  are  collected. 

Summary: 

We  demonstrated  that  cross-layer  information  leads  to  better  classifiers.  We  further  demonstrated  that  using  as  few  as 
nine  cross-layer  features,  including  five  application-layer  features  and  four  network-layer  features,  the  resulting  J48 
classifier  is  almost  as  good  as  the  one  that  is  learned  using  all  the  124  features.  We  showed  that  our  data  collection 
system  roughly  can  be  at  least  25  times  faster  than  the  dynamic  approach  based  on  Capture-HPC. 

3  Resilience  Analysis  against  Adaptive  Attacks 

Cyber  attackers  often  adjust  their  attacks  to  evade  the  defense.  In  the  previous  section,  we  demonstrated  that  the  J48 
classifier  is  very  effective.  However,  it  may  be  possible  that  the  J48  classifier  can  be  easily  evaded  by  an  adaptive 
attacker.  In  this  section,  we  partially  resolve  the  issue. 

Because  the  problem  is  fairly  complicated,  we  start  with  the  example  demonstrated  in  Figure  4.  Suppose  that 
the  attacker  knows  the  defender’s  J48  classifier  M.  The  leaves  are  decision  nodes  with  class  0  indicating  benign 
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Figure  4:  Example  manipulation  of  features  to  evade  J48  classifier. 


URL,  which  is  called  benign  decision  node,  and  1  indicating  malicious  URL,  which  is  called  malicious  de¬ 
cision  node.  Given  the  classifier,  it  is  straightforward  to  see  that  a  URL  associated  with  feature  vector  (X4  = 

0.31;  Xg  =  5.3;  Xi6  =  7.9;  2Gg  =  2.1),  is  malicious  because  of  the  decision  path  Vg  — — — >  v\g  — — »•  V4. 
To  evade  detection,  an  adaptive  attacker  can  adjust  the  URL  properties  that  lead  to  feature  vector  (X4  =  0;  Xg  = 
7.3;  Xig  =  7.9;  Xig  =  2.9).  As  a  consequence,  the  URL  will  be  classified  as  benign  because  of  the  decision  path 

X9<13  -X4<0  X9<7  ^16<9.1  -X^i8>2.3  xt  ,  .  TT  i  i  i- 

vo - >  vio - >  vq - >  vs - >  vj - >  V3.  Now  the  questions  are:  How  can  the  attacker  adjust  to 

manipulate  the  feature  vectors?  How  should  the  defender  respond  to  adaptive  attacks?  As  a  first  step  towards  a  system¬ 
atic  study,  in  what  follows  we  will  focus  on  a  class  of  adaptive  attacks  and  countermeasures,  which  are  characterized 
by  three  adaptation  strategies  that  are  elaborated  below. 

3.1  Resilience  Analysis  Methodology 

Resilience  metrics: 

In  order  to  characterize  the  resilience  of  cross-layer  detection  against  adaptive  attacks,  we  compare  the  effect  of  non- 
adaptive  defense  and  adaptive  defense  against  adaptive  attacks.  The  effect  will  be  mainly  illustrated  through  the 
true-positive  rate,  which  intuitively  reflects  the  degree  that  adaptive  attacks  cannot  evade  the  defense.  The  effect  will 
also  be  secondarily  illustrated  through  the  false-positive  rate,  which  reflects  the  overall  quality  of  the  defense.  This  is 
because  a  classifier  can  achieve  high  true-positive  rate  while  suffering  high  false-positive  rate. 

Three  adaptation  strategies: 

Suppose  that  system  time  is  divided  into  epochs  0, 1,2,.. ..  The  time  resolution  of  epochs  (e.g.,  hourly,  weekly,  or 
monthly)  is  an  orthogonal  issue  and  its  full-fledged  investigation  is  left  for  future  work.  At  the  ith  epoch,  the  defender 
may  use  the  collected  data  to  learn  classifiers,  which  are  then  used  to  detect  attacks  at  the  jth  epoch,  where  j  >  i 
(because  the  classifier  learned  from  the  data  collected  at  the  current  epoch  can  only  be  used  to  detect  future  attacks 
at  any  appropriate  time  resolution).  Since  the  attacker  knows  the  data  collected  by  the  defender  and  also  knows  the 
learning  algorithms  used  by  the  defender,  the  attacker  can  build  the  same  classifiers  as  the  ones  the  defender  may 
have  learned.  Given  that  the  attacker  always  acts  one  epoch  ahead  of  the  defender,  the  attacker  always  has  an  edge  in 
evading  the  defender’s  detection.  How  can  we  characterize  this  phenomenon,  and  how  can  we  defend  against  adaptive 
attacks? 

In  order  to  answer  the  above  question,  it  is  sufficient  to  consider  epoch  i.  Let  Dg  be  the  the  cross-layer  data  the 
defender  has  collected.  Let  Mg  be  the  classifier  the  defender  learned  from  the  training  portion  of  Dg.  Because  the 
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(a)  Parallel  adaptation  (b)  Sequential  adaptation  (c)  Full  adaptation 


Figure  5:  Three  adaptation  strategies  (/,  g,  h  represent  three  possibly  randomized  functions  that  are  chosen  by  the  attacker  from 
respective  function  families). 


attacker  knows  essentially  the  same  Mo,  the  attacker  may  correspondingly  adapt  its  activities  in  the  next  epoch,  during 
which  the  defender  will  collect  data  D\.  When  the  defender  applies  M()  to  I)\  in  real-time,  the  defender  may  not  be 
able  to  detect  some  attacks  whose  behaviors  are  intentionally  modified  by  the  attacker  to  bypass  classifier  Mq.  Given 
that  the  defender  knows  that  the  attacker  may  manipulate  its  behavior  in  the  (i  +  l)st  epoch,  how  would  the  defender 
respond?  Clearly,  the  evasion  and  counter-evasion  can  escalate  further  and  further.  While  it  seems  like  a  perfect 
application  of  Game  Theory  to  formulate  a  theoretical  framework,  we  leave  its  full-fledged  formal  study  for  future 
work  because  there  are  some  technical  subtleties.  For  example,  it  is  infeasible  or  even  impossible  to  enumerate  all  the 
possible  manipulations  the  attacker  may  exploit  to  evade  Mq-  As  a  starting  point,  we  here  consider  the  following  three 
strategies  that  we  believe  to  be  representative. 

•  Parallel  adaptation:  This  strategy  is  highlighted  in  Figure  5(a).  Specifically,  given  Do  (the  data  the  defender 
collected)  and  Mq  (the  classifier  the  defender  learned  from  Do),  the  attacker  adjusts  its  behavior  accordingly  so 
that  D\  =  /(/2o.  Mq),  where  /  is  some  appropriately-defined  randomized  function  that  is  chosen  by  the  attacker 
from  some  function  family.  Knowing  what  machine  learning  algorithm  the  defender  may  use,  the  attacker  can 
learn  Mi  from  D\  using  the  same  learning  algorithm.  Because  the  attacker  may  think  that  the  defender  may 
know  about  /,  the  attacker  can  repeatedly  use  /  multiple  times  to  produce  Di  =  /(Mo,  Do)  and  then  learn  Mt 
from  Di,  where  i  =  2,  3, . . ..  Note  that  because  /  is  randomized,  it  is  unlikely  that  Di  =  Dj  for  i  ^  j. 

•  Sequential  adaptation:  This  strategy  is  highlighted  in  Figure  5(b).  Specifically,  given  Do  (the  data  the  defender 

collected)  and  Mo  (the  classifier  the  defender  learned  from  Do),  the  attacker  adjusts  its  behavior  so  that  D  \  = 
g{Do,  Mq),  where  g  is  some  appropriately-defined  randomized  function  that  is  chosen  by  the  attacker  from  some 
function  family,  which  may  be  different  from  the  family  of  functions  from  which  /  is  chosen.  Knowing  what 
machine  learning  algorithm  the  defender  may  use,  the  attacker  can  learn  Mi  from  D\  using  the  same  learning 
algorithm.  Because  the  attacker  may  think  that  the  defender  may  know  about  g,  the  attacker  can  repeatedly  use 
g  multiple  times  to  produce  Di  =  i)  and  then  learn  Mi  from  Di,  where  i  =  1,2,.... 

•  Full  adaptation:  This  strategy  is  highlighted  in  Figure  5(c).  Specifically,  given  Do  and  Mq,  the  attacker  adjusts 

its  behavior  so  that  D\  =  h{Do,  Mq  )  for  some  appropriately-defined  randomized  function  that  is  chosen  by  the 
attacker  from  some  function  family,  which  may  be  different  from  the  families  of  functions  from  which  /  and  g 
are  chosen.  Knowing  what  machine  learning  algorithm  the  defender  may  use,  the  attacker  can  learn  Mi  from 
D\  using  the  same  learning  algorithm.  Because  the  attacker  may  think  that  the  defender  may  know  about  h,  the 
attacker  can  repeatedly  use  h  multiple  times  to  produce  1),  =  ,  Do  - . . . ,  D,  -  \ )  and  then  learn  Mi  from 

Di,  where  i  =  1,2,.... 

Defender’s  strategies  to  cope  with  adaptive  attacks: 

How  should  the  defender  react  to  the  adaptive  attacks?  In  order  to  characterize  the  resilience  of  the  classifiers  against 
adaptive  attacks,  we  need  to  have  real  data,  which  is  impossible  without  participating  in  a  real  attack-defense  escalation 
situation.  This  forces  us  to  use  some  method  to  obtain  synthetic  data.  Specifically,  we  design  functions  f,  g,  and  h  to 
manipulate  the  data  records  corresponding  to  the  malicious  URLs,  while  keeping  intact  the  data  records  corresponding 
to  the  benign  URLs.  Because  f,  g,  and  h  are  naturally  specific  to  the  defender’s  learning  algorithms,  we  here  propose 
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the  following  specific  functions/algorithms  corresponding  to  J48,  which  was  shown  in  the  previous  section  to  be  most 
effective  for  the  defender. 

At  a  high-level.  Algorithm  1  takes  as  input  dataset  Dg  and  adaptation  strategy  ST  £  {/,<?,  h}.  In  our  case  study, 
the  number  of  adaptation  iterations  is  arbitrarily  chosen  as  8.  This  means  that  there  are  9  classifiers  Mo,  Mi, . . . ,  Mg, 
where  M,;  is  learned  from  I)t. 

•  For  parallel  adaptation,  we  consider  the  following  /  function:  Di  consists  of  feature  vectors  in  Dg  that  cor¬ 
respond  to  benign  URLs,  and  the  manipulated  versions  of  the  feature  vectors  in  Do  that  correspond  to  the 
malicious  URLs. 

•  For  sequential  adaptation,  we  consider  the  following  g  function:  I)l+\  consists  of  the  benign  portion  of  Do,  and 
the  manipulated  portion  of  Di  where  the  manipulation  is  conducted  with  respect  to  classifier  M,  . 

•  For  full  adaptation,  we  consider  the  following  h  function:  the  benign  portion  of  ZL+i  is  the  same  as  the  benign 
portion  of  Do,  and  the  manipulated  portion  is  derived  from  Do ,  D\, . . . ,  Di  and  D[,  where  D[  is  obtained  by 
manipulating  Di  with  respect  to  classifier  M,  . 


Algorithm  1  Defender’s  algorithm  main  (.Do,  ST,  a ) 

INPUT:  Do  is  the  input  feature  vectors,  ST  indicates  attack  strategy,  a  is  the  number  of  adaptation  iterations 
OUTPUT:  Mo,  . . . ,  Ma  are  classifiers 

1:  initialize  array  Do,  ■  ■  ■ ,  Da+ 1  where  I),  is  the  feature  vectors  consisting  of  benign  URLs  (dubbed  benFV)  and 
malicious  URLs  (dubbed  malFV) 

2:  for  i= 0  to  a  do 

3:  Mi  ■*—  J48.buildmodel(Di) 

4:  switch 

5:  case  1  ST  =  PARALLEL-ADAPTATION 

6:  Di  <r-  Di.benFV  +  manipulate(Do. malFV,  Mo)  {this  is  one  example  of  function  /} 

7:  case  2  ST  =  SEQUENTIAL-ADAPTATION 

S:  Di+i  •<—  Di.benFV  +  manipulated^. malFV,  Mi)  {this  is  one  example  of  function  g} 

9:  case  3  ST  =  FULL-ADAPTATION 

10:  Di+\.benFV  t—  Do-benFV 

11:  D\  <—  manipulated^. 'malFV,  Mt) 

12:  d+i- malFV  0 

13:  for  j  =  1  to  malFV. size 

14:  d  £r  D0.malFV[j],  ...,  Di.malFV[j],  D[[j] 

15:  Di+i.malFV[j ]  <—  d 

16:  end  for 

17:  {this  is  one  example  of  function  h} 

18:  end  switch 

19:  end  for 

20:  return  Mj(i  =  0, . . . ,  8) 


In  order  to  help  understand  Algorithm  2,  let  us  consider  another  example  in  Figure  4.  For  each  non-leaf  node  v 
associated  to  feature  Xi,  v.ival  is  the  interval  [mini,  condition_value].  For  example,  in  Figure  4,  vo  is  associated 
with  feature  Xg  and  vo’s  condition_value  is  13,  which  means  vq Aval  =  [ming,13].  Feature  vector  (X4  = 

—  1;  Xg  =  5;  Xie  =  5;  Xls  =  0)  will  lead  to  decision  path  vo  Xs>~13>  U10  Xi-0y  vg  A'9~7>  v\,  which  means  that 
the  corresponding  URL  is  classified  as  malicious.  For  feature  Xg,  let  us  denote  its  domain  by  Domain(Xg)  = 
{ming, . . . ,  maxg},  where  ming  (maxg)  is  the  minimum  (maximum)  value  of  Xg.  In  order  to  evade  detection,  the 
attacker  can  manipulate  the  value  of  feature  X9  so  that  v-\  will  not  be  on  the  decision  path.  This  can  be  achieved  by 
assigning  a  random  value  from  interval  (7, 13],  which  is  called  escape  interval  and  can  be  derived  as 

([ming,  maxg]  \  [ming,  7])  |^| [ming,  13]  =  (Domain(Xg)  \  vg.ival)  vg-ival. 

Algorithm  2  is  based  on  the  above  observation  and  aims  to  assign  escape  interval  to  each  malicious  decision  node, 
which  is  then  used  in  Algorithm  3. 

The  basic  idea  underlying  Algorithm  3  is  to  transform  a  feature  vector,  which  corresponds  to  a  malicious  URL, 
to  a  feature  vector  that  will  be  classified  as  benign.  We  use  the  same  example  to  illustrate  how  the  algorithm  works. 


13 


Approved  for  Public  Release;  Distribution  Unlimited:  88ABW-201 1-5874,  04  Nov  201 1 


Algorithm  2  Algorithm  preparation(DT) 
l:  initiate  an  empty  queue  Q 

2:  for  all  v  G  DT  do 

3:  if  v  is  leaf  AND  v  =  “malicious”  then 

4:  append  v  to  queue  Q 

5:  end  if 

6:  end  for 

7:  for  all  v  G  Q  do 

8:  v.featureName  ■<—  v.parent.featureName 

9:  v.ival  •<—  Domain{v. parent)  \  v.escjival  {Domain(X)  is  the  domain  of  feature  X } 

10:  v'  <r-  v.parent 

1 1 :  while  v'  ^  root  do 

12:  if  v'  f  eat ureN ame  =  v .  f  eautreN ame  then 

13:  v.escjival  t—  v' Aval  p|  v.escjival 

14:  end  if 

15:  v'  •<—  v' . parent 

16:  end  while 

17:  end  for 


Considering  feature  vector  (X±  =  1;  Xg  =  5;  Am  =  5;Xig  =  5),  the  adaptive  attacker  can  randomly  choose  a 

value,  say  8,  from  V\ .escjival  and  assign  it  to  Xg.  This  will  make  the  new  decision  path  avoid  v\  but  go  through  its 
sibling  vs  because  the  new  decision  path  becomes 

Xg<13  *4<0  Xg>7  -^16^9.1  Xi8>2.3 

^0  - >  - >  Vq  - >  Vs  - >  V7  - >  ^3- 

The  key  idea  is  the  following:  if  the  new  decision  path  still  reaches  a  malicious  leaf,  the  algorithm  will  recursively 
manipulate  the  feature  values  of  its  parent  by  diverting  to  its  sibling  node. 

Now  let  us  discuss  the  time  complexity  of  the  defender’s  algorithms.  Recall  that  a  is  the  number  of  adaptation 
times.  Suppose  that  the  training  dataset  has  N  feature  vectors  and  each  feature  vector  has  m  features.  According 
to  [38],  the  size  of  decision  tree  is  O(N).  In  Algorithm  2,  lines  2  to  6  have  time  complexity  O(N),  which  is  to 
traverse  the  decision  tree  and  find  malicious  decision  nodes  (i.e.,  leaves  with  malicious  label).  Lines  7  to  17  have 
time  complexity  0(N2),  which  is  to  assign  escape  interval  to  each  malicious  decision  node.  In  Algorithm  1,  line  3 
has  time  complexity  0(Nm2)  [45]  and  lines  13  to  16  have  time  complexity  O(N).  The  overall  time  complexity  is 
0(aN ( m 2  +  N)).  In  practice,  this  time  complexity  is  lightweight  even  on  a  regular  desktop  (at  the  magnitude  of  tens 
of  seconds). 

Evaluation  scenarios: 

Because  there  are  three  aggregation  methods  and  both  the  attacker  and  the  defender  can  take  the  three  adaptation 
strategies,  there  are  3  x  3  x  3  =  27  scenarios.  In  the  present  paper,  for  each  aggregation  method,  we  focus  on  three 
scenarios  that  can  be  characterized  by  the  assumption  that  the  attacker  and  the  defender  will  use  the  same  adaptation 
strategy,  we  will  investigate  all  possible  scenarios  in  the  future.  For  each  scenario,  we  consider  the  following  four 
configurations: 

•  The  attacker  does  not  adapt  but  the  defender  adapts  multiple  times. 

•  The  attacker  adapts  once  but  the  defender  does  not  adapt. 

•  The  attacker  adapts  once  but  the  defender  adapts  multiple  times. 

•  Both  the  attacker  and  the  defender  adapt  multiple  times. 

3.2  Cross-layer  Resilience  Analysis 

Resilience  measurement  through  true-positive  rate: 

Figure  6  plots  the  true-positive  rate  with  respect  to  the  number  of  features  that  are  manipulated  by  an  adaptive  attacker 
in  the  case  of  data-level  aggregation.  We  observe  that  if  the  attacker  is  adaptive  but  the  defender  is  non-adaptive,  then 
most  malicious  URLs  will  not  be  detected  as  we  elaborate  below.  For  parallel  and  sequential  adaptations,  the  true¬ 
positive  rate  of  Mq{D\)  drops  to  0%  when  the  attacker  adapts  its  behavior  by  manipulating  two  features.  Even  in  the 


14 


Approved  for  Public  Release;  Distribution  Unlimited:  88ABW-201 1-5874,  04  Nov  201 1 


Algorithm  3  Algorithm  manipulate!!?,  M)  for  transforming  malicious  feature  vector  to  benign  feature  vector 
INPUT:  D  is  dataset,  M  is  classifier 
OUTPUT:  manipulated  dataset 
1:  DT  <—  M.DT  {  DT  is  the  J48  decision  tree} 

2:  preparation!!?!1) 

3:  for  all  feature  vector  fv  &  D  do 
4:  vi—  DT.root 

5:  while  NOT  ( v  is  leaf  AND  v  ^  “benign”)  do 

6:  if  v  is  leaf  AND  v  =  “ malicious ”  then 

7:  pick  a  value  n  £  v.ival  at  random 

8:  /v.setFeatureValue(v. /eatureiVame,  n) 

9:  v<—  v. sibling 

10:  end  if 

11:  if  v  is  not  leaf  then 

12:  if  fv.featureValue  <  v.featureValue  then 

13:  v  i—  v.leftChild 

14:  else 

15:  v  i—  v.rightChild 

16:  end  if 

17:  end  if 

18:  end  while 

19:  end  for 
20:  return  D 


case  of  full  adaptation  defense,  the  true-positive  rate  of  Mg{Di)  can  drop  to  about  50%  when  the  attacker  adapts  its 
behavior  by  manipulating  two  features.  We  also  observe  that  if  the  attacker  is  not  adaptive  but  the  defender  is  adaptive, 
then  most  malicious  URLs  will  be  detected.  This  is  shown  by  the  curves  corresponding  to  Mg.g(Dg).  Also,  if  both 
the  attacker  and  defender  are  adaptive  and  the  attacker  adapts  its  behavior  a  step  further  than  the  defender,  parallel 
adaptation  and  full  adaptation  defense  can  still  effectively  detect  the  evasion,  but  the  sequential  adaptation  defense 
failed  to  detect  the  evasion  when  the  attacker  manipulated  two  features.  We  can  see  this  from  curves  corresponding  to 
Mg.^(Dg).  True -positive  rate  of  both  parallel  adaptation  and  full  adaptation  defense  changes  little,  while  true -positive 
rate  of  sequential  adaptation  defense  drops  to  zero  when  two  features  are  manipulated.  A  possible  explanation  is 
that  Dg  in  sequential  adaptation  strategy  is  manipulated  from  /?8,  which  is  comparable  to  the  scenario  that  1)\  is 
manipulated  from  Dg.  This  is  also  reflected  on  the  two  overlapping  curves  Mg{D\ )  and  Mg.g(Dg). 


(a)  Parallel  adaptation  (b)  Sequential  adaptation  (c)  Full  adaptation 

Figure  6:  Comparison  of  cross-layer  adaptive  defense  against  adaptive  attacks:  data-level  aggregation. 

Figure  7  plots  the  true-positive  rate  in  the  case  of  AND-aggregation,  which  is  similar  to  the  results  in  the  case  of 
data-level  aggregation.  For  example,  if  the  attacker  is  adaptive  but  the  defender  is  non-adaptive,  most  malicious  URLs 
will  not  be  detected  because  the  true-positive  rate  of  M0(!?i)  becomes  0%  when  the  attacker  manipulates  two  features 
in  the  cases  of  parallel  and  sequential  adaptations. 

Figure  8  plots  the  results  in  the  case  of  OR-aggregation  detection.  We  observe  that  if  the  attacker  is  adaptive  but 
the  defender  is  non-adaptive,  some  malicious  URLs  will  not  be  detected.  This  can  be  seen  from  the  fact  that  the 
true-positive  rate  of  Mg(D\)  drops  4%-8%  when  the  attacker  adapts  its  behavior  by  manipulating  two  features.  Note 
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Figure  7:  Comparison  of  cross-layer  adaptive  defense  against  adaptive  attacks:  AND -aggregation. 


that  this  is  different  from  the  data-level  aggregation  and  AND-aggregation  defense,  where  the  true-positive  rate  drops 
to  0%  when  two  features  are  manipulated.  This  can  be  explained  by  the  fact  that  the  OR-aggregation  has  naturally 
high  true-positive.  For  the  cases  where  attacker  is  non-adaptive  and  defender  is  adaptive,  or  both  attacker  and  defender 
are  adaptive,  we  observe  the  same  phenomenon  as  the  data-level  and  AND-aggregation  defense.  To  summarize,  OR- 
aggregation  adaptive  defense  is  more  resilient  than  data-level  aggregation  and  AND-aggregation  adaptive  defense. 
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Figure  8:  Comparison  of  cross-layer  adaptive  defense  against  adaptive  attacks:  OR-aggregation. 


Resilience  measurement  through  false-positive  rate: 

As  mentioned  above,  evaluation  based  on  true-positive  rate  alone  can  be  misleading.  Table  4  describes  the  true-positive 
rate,  false-negative  rate,  and  false-positive  rate  of  cross-layer  adaptive  defense  against  adaptive  and  non-adaptive 
attacks.  We  have  the  following  observations. 

First,  if  the  attacker  is  not  adaptive  and  the  defender  is  adaptive,  using  more  classifiers  based  on  full-adaptation 
will  achieve  effective  detection.  This  can  be  seen  from  columns  Mq(Dq)  and  Mq.s(Do).  For  parallel  and  sequential 
adaptation  strategies,  data-level  aggregation  and  OR-aggregation  detections  Mq.s,(Dq)  have  higher  false-negative  rate 
than  Mq(Dq),  and  AND-aggregation  adaptive  detection  using  either  Mq.s(Dq)  or  Mq(Dq)  has  the  same  true-positive 
rate.  For  full-adaptation  strategy,  using  either  Mq.s(Dq)  or  Mq(Dq)  for  cross-layer  detection  has  the  same  false¬ 
positive  rate,  which  shows  the  advantage  of  full-adaptation  strategy. 

Second,  if  both  attackers  and  defenders  are  adaptive,  using  a  larger  number  of  adaptive  classifiers  can  significantly 
lower  both  false-negative  and  false-positive  rates.  This  can  be  seen  from  columns  A/() {I)\  )  and  Mo-s(-Di).  For  all 
adaptation  strategies,  using  classifier  Mqs  in  all  three  aggregation  methods  can  achieve  higher  true-positive  rate.  For 
parallel  and  sequential  adaptation  strategies,  data-level  aggregation  and  AND-aggregation  detections  Mq  ( 1)  \ )  failed  to 
detect  the  malicious  websites,  but  OR-aggregation  detection  can  achieve  92.4%  true -positive  rate. 

Third,  OR-aggregation  detection  using  full-adaptation  strategy  can  achieve  the  best  true-positive  rate.  We  can  see 
this  from  vertical  perspective  of  Table  4.  Especially,  if  the  defender  is  adaptive,  whether  the  attacker  is  adaptive  or  not, 
OR-aggregation  can  achieve  a  better  true-positive  rate. 

Are  the  manipulated  features  the  important  ones? 

One  may  expect  that  the  important  features  are  the  ones  that  will  be  manipulated  by  the  attacker  in  attempting  to  evade 
defense.  It  is  somewhat  surprising  to  see  that  this  is  not  necessarily  the  case. 

Figure  9  shows  which  features  are  manipulated  by  the  attacker  so  as  to  bypass  classifier  Mq.  In  order  to  evade  the 
1,467  malicious  URLs  from  the  defense,  our  algorithm  manipulated  a  few  features.  We  observe  that  there  is  no  simple 
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Strategy 

Cross-layer 

M0  (D0) 

M)-8  (A)) 

M0  (D,) 

Mo-sC-Di) 

aggregation  method 

TP 

FN 

FP 

TP 

FN 

FP 

TP 

FN 

FP 

TP 

FN 

FP 

Parallel 

data-level  aggregation 

99.52 

0.48 

0.03 

98.70 

1.30 

0.02 

0.00 

100.00 

0.03 

99.11 

0.89 

0.02 

AND -aggregation 

92.37 

7.63 

0.00 

92.37 

7.63 

0.00 

0.00 

100.00 

0.00 

92.37 

7.63 

0.00 

OR-aggregation 

99.52 

0.48 

0.06 

98.70 

1.30 

0.05 

92.37 

7.63 

0.06 

99.32 

0.68 

0.05 

Sequential 

data-level  aggregation 

99.52 

0.48 

0.03 

98.70 

1.30 

0.00 

0.00 

100.00 

0.03 

98.70 

1.30 

0.00 

AND-aggregation 

92.37 

7.63 

0.00 

92.37 

7.63 

0.00 

0.00 

100.00 

0.00 

92.37 

7.63 

0.00 

OR-aggregation 

99.52 

0.48 

0.06 

98.77 

1.23 

0.03 

92.37 

7.63 

0.06 

98.70 

1.30 

0.03 

Full 

data-level  aggregation 

99.52 

0.48 

0.03 

99.52 

0.48 

0.01 

49.63 

50.37 

0.03 

99.52 

0.48 

0.01 

AND-aggregation 

92.37 

7.63 

0.00 

92.37 

7.63 

0.00 

46.42 

53.58 

0.00 

92.37 

7.63 

0.00 

OR-aggregation 

99.52 

0.48 

0.06 

99.52 

0.48 

0.05 

95.57 

4.43 

0.06 

99.52 

0.48 

0.05 

Table  4:  Adaptive  defense  vs.  (non-)adaptive  attack  using  cross-layer  detection  (TP:  true-positive  rate;  FN:  false-negative  rate;  FP: 
false-positive  rate).  Note  that  TP  +  FN  =1.  Four  features  are  manipulated  in  all  cases. 


Figure  9:  Important  features  vs.  manipulated  features  (red  bar:  feature  importance  according  to  GainRatioAttributeEval; 
blue  bar:  frequency  a  feature  is  manipulated  when  manipulating  the  1,467  malicious  URLs;  the  blue  bars  overlay  some  red  bars). 


correspondence  between  the  most  often  manipulated  features  and  the  most  important  features,  which  were  ranked 
using  the  GainRatioAttributeEval  feature  selection  method  mentioned  in  Section  2.3. 

•  At  the  application  layer,  which  is  relevant  because  of  the  AND-  and  OR-aggregation,  two  features,  Postal_code 
and  Redirect_num,  need  to  be  manipulated  in  order  to  evade  application-layer  classifier  Mq.  These  two  fea¬ 
tures  are  not  very  important  in  terms  of  their  contributions  to  the  classifiers,  but  their  manipulation  allows  the 
attacker  to  evade  detection.  This  phenomenon  tells  us  that  non-important  features  can  also  play  an  important 
role  in  evading  detection.  The  reason  that  only  two  features  need  to  be  manipulated  can  be  attributed  to  the  fact 
that  the  application-layer  decision  tree  is  unbalanced  and  has  short  paths. 

•  At  the  network  layer,  which  is  relevant  because  of  the  AND-  and  OR-aggregation,  four  features  are  manip¬ 
ulated  to  evade  M0.  The  four  features  are:  Dist_remote_TCP_port,  Dist_remote_IP,  Duration, 
Local_app_packets,  which  are  ranked  the  2nd,  3rd,  8th,  9th  in  terms  of  their  importance  to  the  classi¬ 
fier.  However,  they  are  most  often  manipulated  because  they  correspond  to  nodes  that  are  typically  parents  of 
the  leaves  that  indicate  malicious  URLs.  Another  two  features  are  important  features. 

•  At  the  cross  layer,  there  are  only  four  features  that  need  be  manipulated  in  order  to  evade  the  detection  of  cross¬ 
layer  Mo  as  shown  in  Table  4.  Like  the  network-layer  defense,  the  manipulation  of  four  features  will  lead  to  a 
high  evasion  success  rate.  The  four  features  are  the  same  as  the  network-layer  ones  that  are  manipulated. 

The  proceeding  phenomenon,  namely  that  the  manipulated  features  are  not  necessarily  the  important  features,  can  be 
explained  as  follows.  Our  manipulation  algorithm  often  manipulates  the  features  that  are  parents  of  the  leaves,  which 
are  often  less  important  than  the  features  that  are  close  to  the  root. 

What  kinds  of  decision-trees  are  more  resilient? 

From  the  defender’s  perspective,  OR-aggregation  cross-layer  detection  is  better  than  data-level  aggregation  and  AND- 
aggregation  cross-layer  detection,  and  full  adaptation  is  better  than  parallel  and  sequential  adaptations  in  the  investi¬ 
gated  scenarios.  Perhaps  more  importantly,  we  observe  that  from  the  defender’s  perspective,  less  important  features 
are  also  crucial  to  correct  classification.  If  one  wants  to  build  a  resilient  decision  tree,  we  offer  the  following  guideline. 

A  decision  tree  is  more  resilient  against  adaptive  attacks  if  it  is  balanced  and  tall.  This  is  because  a  short 
path  will  make  it  easy  to  evade  the  detection. 

To  justify  these  guidelines,  we  note  that  the  heights  of  the  application-,  network-  and  data-level  aggregation  cross¬ 
layer  decision  trees  are  4,  6  and  6,  respectively.  Our  experiments  showed  that  the  attacker  only  needs  to  manipulate 
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app-layer 

net-layer 

data-level  aggregation 

w/o  feature  selection 

105/2 

19/4 

124/4 

w/  feature  selection 

9/1 

11/3 

9/2 

Table  5:  #  of  manipulated  features  w/  or  w/o  feature  selection  ( a/b :  the  input  J48  classifier  was  learned  from  dataset  of  a  features, 
of  which  b  features  are  manipulated  for  evasion). 


two  features  for  evading  the  application-layer  classifier,  and  to  manipulate  four  features  for  evading  both  the  network- 
layer  and  cross-layer  classifiers.  Moreover,  let  us  look  at  Table  5,  we  observe  the  following.  First,  application- 
layer  detection  with  feature  selection  is  evaded  by  manipulating  the  Whois_netType  feature,  which  is  different 
from  application-layer  detection  without  feature  selection.  Second,  network-layer  detection  with  feature  selection  is 
evaded  by  manipulating  three  features  (i.e..  Duration,  Dist_TCP_port  and  App_packets),  where  the  former 
two  features  are  the  same  as  the  network-layer  detection  without  feature  selection.  Third,  detection  based  on  data- 
level  aggregation  is  evaded  by  manipulating  two  features  (i.e.,  Dist_TCP_port  and  DNS_answer_times),  where 
the  former  is  the  same  as  the  data-level  aggregation  detection  without  feature  selection. 


4  Related  Work 

Both  industry  and  academia  are  actively  seeking  effective  solutions  to  the  problem  of  malicious  websites.  Industry  has 
mainly  offered  their  proprietary  blacklists  of  malicious  websites,  such  as  Google’s  Safe  Browsing  [1].  Researchers 
have  used  Logistic  regression  to  study  phishing  URLs  [21],  but  without  considering  the  issue  of  redirection.  Redirec¬ 
tion  has  been  used  as  indicator  of  web  spams  [8,  37,  54,  44].  Kurt  et  al.  [51]  presented  a  system  for  scalably  detecting 
spam  contents.  Ma  et  al.  [31,  32]  studied  how  to  detect  phishing  and  spams  based  on  URLs  themselves. 

In  terms  of  detecting  malicious  websites  that  may  host  malwares,  Nazario  developed  PhoneyC  to  detect  attacks 
that  exploit  known  vulnerabilities  [34],  Choi  et  al.  [13]  investigated  the  detection  of  malicious  URLs,  and  Canali  et  al. 
[11]  presented  the  design  and  implementation  of  a  static  detection  tool  called  Prophiler.  However,  all  these  studies  did 
not  consider  the  usefulness  of  cross-layer  detection  and  adaptive  attacks.  On  the  other  hand,  the  back-end  system  for 
deeper  analysis  is  also  an  active  research  topic  [15,  12,  55,  33],  because  attackers  have  been  attempting  to  circumvent 
dynamic  analysis  [26,  46]. 

Our  resilience  study  falls  into  the  topic  of  adversarial  machine  learning,  where  the  attacker  aims  to  evade  a  detec¬ 
tion  mechanism  that  is  derived  from  some  machine  learning  method  [7,  52].  In  the  context  of  unsupervised  learning 
for  anomaly  detection,  Perdisci  et  al.  [41]  investigated  how  to  make  the  detection  harder  to  evade.  In  the  context  of 
supervised  learning,  existing  studies  fall  into  two  categories.  In  the  first  category,  the  attacker  can  poison/manipulate 
the  training  data.  Existing  studies  for  tackling  this  problem  include  [40,  36,  48].  In  the  second  category,  the  training 
data  is  not  poisoned.  There  are  two  scenarios.  First,  the  attacker  has  black-box  access  to  the  detection  mechanism 
and  attempts  to  evade  detection  after  making,  ideally,  as  few  as  possible  queries  to  the  detection  mechanism.  This 
setting  has  been  studied  in  [30,  35].  Second,  the  attacker  has  access  to  the  detection  mechanism,  which  is  true  for  our 
resilience  study.  Dalvi  et  al.  [18]  used  Game  Theoretic  method  to  study  this  problem  in  the  setting  of  spam  detection, 
which  is  reasonable  because  a  rational  spammer  certainly  would  take  into  account  the  cost  of  attempting  to  evade 
detection.  However,  this  is  not  necessarily  applicable  to  our  resilience  study  because  the  attacker  does  not  have  to  be 
rational  because  any  successful  evasion  will  cause  successful  attacks.  Moreover,  our  model  actually  gives  the  attacker 
more  freedom  since  the  attacker  knows  the  data  the  defender  collected. 


5  Limitations  and  Future  Work 

Our  method  and  system  are  not  without  limitations.  First,  our  study  is  based  on  given  URLs.  For  systematic  detection 
and  defense,  it  is  ideal  to  crawl  all  domains  and  all  URLs.  Fortunately,  it  is  relatively  easy  to  extend  our  system  from 
this  perspective. 

Second,  our  current  data  collection  system  may  not  be  able  to  deal  with  obfuscated  JavaScript-based  redirection, 
which  is  a  challenging  open  problem  [20].  Although  our  collected  data  hints  that  JavaScript-based  redirection  is 
widely  used  by  malicious  websites,  it  appears  that  JavaScript  obfuscation  may  not  have  been  widely  used  because  our 
system  can  effectively  detect  the  malicious  URLs.  However,  this  is  not  true  in  general.  It  is  perhaps  equally  important 
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to  deal  with  encrypted  web  contents.  Fortunately,  any  progress  in  these  directions  can  be  plugged  into  our  system 
in  a  modular  fashion.  We  plan  to  enhance  our  static  analysis  system  by  incorporating  new  techniques  such  as  those 
described  in  [15,  49,  39,  17]. 

Third,  we  considered  a  certain  class  of  adaptive  attacks  that  attempt  to  evade  detection,  where  the  same  type  of 
adaptation  strategy  is  used  by  both  attacker  and  defender.  Our  adaptive  defense  against  these  adaptive  attacks  was 
shown  to  be  effective.  This  is  based  on  the  assumption  that  the  defender  knows  the  attacker’s  adaptation  strategies, 
which  may  not  be  true  in  general.  Moreover,  there  could  be  other  classes  of  adaptive  attacks  that  may  be  able  to  defeat 
our  defense.  Our  artificial  manipulation  algorithms  for  mimicking  adaptive  attacks  may  or  may  not  be  truthful  because 
when  the  algorithms  changes  the  values  of  some  application-layer  features,  some  network-layer  features  may  have  to 
be  changed  correspondingly  so  as  to  keep  the  cross-layer  semantics  consistent.  This  is  a  non-trivial  problem  because 
there  is  no  simple  mapping  between  the  application-layer  features  and  the  network-layer  features.  Finally,  our  adaptive 
defense  is  reminiscent  of  the  studies  in  robust  classifiers  [10,  27,  22,  28,  9],  which  makes  us  believe  that  our  study  will 
spark  more  investigation  on  resilience,  especially  formal  treatment  and  characterization  on  the  power  and  limitation  of 
our  defense. 

Fourth,  our  study  is  mainly  based  on  the  dataset  we  collected  on  a  specific  day.  During  our  study,  we  also  analyzed 
the  datasets  collected  on  several  consecutive  days,  from  which  we  obtained  similar  results.  Nevertheless,  we  plan  to 
conduct  a  systematic  study  over  datasets  that  are  collected  by  90  consecutive  days.  We  will  report  our  results  in  a 
future  version  of  the  present  paper. 

6  Conclusion 

We  presented  a  novel  approach  for  detecting  malicious  websites  based  on  proactive  static  analysis  of  website  contents. 
Unlike  existing  studies  based  on  application-layer  data  alone,  our  cross-layer  analysis  takes  advantage  of  the  corre¬ 
sponding  network-layer  information.  Moreover,  we  study  the  resilience  of  our  solution  against  adaptive  attacks.  As 
discussed  above,  our  study  brings  out  several  important  and  challenging  research  problems. 
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of  Networked  Systems 
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Abstract.  Quantitative  security  analysis  of  networked  computer  systems  has  been  an 
open  problem  in  computer  security  for  decades.  Recently,  a  promising  approach  was 
proposed  in  [Li  et  al.  11],  which,  however,  made  some  strong  assumptions  including 
the  exponential  distribution  of,  and  the  independence  among,  the  relevant  random 
variables.  In  this  paper,  we  substantially  weaken  these  assumptions  while  offering,  in 
addition  to  the  same  types  of  analytical  results  as  in  [Li  et  al.  11],  methods  for  obtain¬ 
ing  the  desired  security  quantities  in  practice.  Moreover,  we  investigate  the  problem 
from  a  higher-level  abstraction,  which  also  leads  to  both  analytical  results  and  prac¬ 
tical  methods  for  obtaining  the  desired  security  quantities.  These  should  represent  a 
significant  step  toward  ultimately  solving  the  problem  of  quantitative  security  analysis 
of  networked  computer  systems. 


I.  Introduction 

Quantitative  security  analysis  of  networked  computer  systems  from  a  whole- 
system,  rather  than  from  an  individual-component,  perspective  is  one  of  the 
most  important  open  problems  in  cyber  security.  Recently,  a  promising  stochastic 
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process  approach  to  tackling  this  problem  was  proposed  in  [Li  et  al.  11].  In 
this  approach,  a  networked  computer  system  is  modeled  as  a  complex  network, 
or  graph,  G  =  (V,  E),  where  V  is  the  vertex,  or  node,  set  and  E  is  the  edge 
set  over  which  direct  attacks  can  be  carried  out.  Then  a  stochastic  abstraction 
of  the  interactions  between  the  attacker  and  the  defender  taking  place  in  G 
is  considered.  The  main  research  task  is  to  derive  the  following  two  security 
measures: 

•  q(v):  the  probability  that  a  node  v  G  V  is  compromised  (or  attacked)  in  the 
steady  state. 

•  q :  the  probability  that  a  uniformly  chosen  node  in  G  is  compromised  (or 
attacked)  in  the  steady  state. 

The  quantity  q  is  particularly  important  for  decision-making  because  it  captures 
or  reflects  the  global  security  level  of  a  networked  system  in  a  useful  way.  For 
example,  given  q ,  the  defender  can  design  or  deploy  appropriate  cryptographic 
or  security  schemes  to  mitigate  the  damage  of  the  attacks.  (Some  extensions  to 
q  will  be  discussed  in  Section  7.) 

1.1.  Our  Contribution 

In  this  paper,  we  make  substantial  improvements  upon  [Li  et  al.  11],  which 
assumed  that  the  relevant  random  variables  follow  the  exponential  distribution 
and  are  independent  of  each  other.  Specifically,  we  consider  two  cases:  (I)  the 
case  that  the  lower-level  attack-defense  process  is  observed,  which  is  the  same  as 
in  [Li  et  al.  11]  but  with  much  weaker  assumptions;  (II)  the  case  that  the  higher- 
level  alternating  renewal  process,  but  not  the  lower-level  attack-defense  process, 
is  observed.  The  latter  case  was  not  investigated  in  [Li  et  al.  11]  and  may  be 
more  realistic. 

In  the  case  that  the  lower-level  attack-defense  process  is  observed  (Section  4), 
we  present  new  analytical  results  for  deriving  q(v)  under  weaker  assumptions, 
where  q(v)  is  the  probability  that  an  arbitrary  node  v  €  V  is  compromised  in 
the  steady  state.  On  the  one  hand,  these  new  analytical  results  lead  to  useful 
methods  for  obtaining  the  global  security  measure  q  in  practice: 

(i)  In  the  case  that  G  is  a  regular  graph  with  moderate  node  degree  and  the 
dependence  between  the  nodes  can  be  ignored,  we  derive  a  closed-form 
solution  for  q. 
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(ii)  In  the  case  that  G  is  an  arbitrary  graph  (including  the  case  of  a  regular 
graph  but  with  large  node  degree)  and  the  dependence  between  the  nodes 
can  be  ignored,  we  give  an  approximation  solution  for  q. 

(iii)  In  the  case  that  G  is  an  arbitrary  graph  and  the  dependence  between  the 
nodes  cannot  be  ignored,  we  describe  a  statistical  solution  for  q. 

On  the  other  hand,  the  new  analytical  results  lead  to  bounds  of  q  both  in  the 
case  that  G  is  regular  graph  and  in  the  case  that  G  is  an  arbitrary  graph.  Such 
results  are  interesting  and  useful  because  they  demand  much  less  information. 
The  upper  bounds  are  especially  useful  in  decision-making  because  they  capture 
a  sort  of  worst-case  scenario  from  the  defender’s  perspective. 

In  the  case  that  the  lower-level  attack-defense  process  is  not  observed,  but  the 
higher- level  alternating  renewal  process  is  observed  (Section  5),  we  present  new 
analytical  results  for  estimating  q(v).  These  analytical  results  lead  to  statistical 
methods  for  obtaining  the  global  security  measure  q  in  practice.  Note  that  this 
case  was  not  investigated  in  [Li  et  al.  11],  and  is  harder  to  deal  with  analytically 
due  to  the  lack  of  sufficient  lower- level  information. 

Note  that  the  results  in  [Li  et  al.  11]  and  in  the  present  paper  are  applicable  to 
both  undirected  and  directed  graphs.  To  simplify  notation,  we  will  focus  on  the 
case  of  undirected  graphs,  while  noting  that  when  we  adapt  the  results  to  the 
case  of  directed  graphs,  we  need  to  replace  “node  degree”  with  “node  in-degree” 
(a  node  v’s  in-degree  is  the  number  of  nodes  that  point  to  v). 

1.2.  Outline  of  the  Paper 

The  rest  of  the  paper  is  organized  as  follows.  In  Sections  2  and  3,  we  briefly 
review  the  model  and  results  in  [Li  et  al.  11]  and  some  probabilistic  notions, 
respectively.  In  Section  4,  we  present  new  results  for  the  case  in  which  the  lower- 
level  attack-defense  process  is  observed.  In  Section  5,  we  present  new  results 
for  the  case  in  which  the  higher-level  alternating  renewal  process  is  observed. 
In  Section  6,  we  review  related  previous  studies.  In  Section  7,  we  discuss  the 
utility  of  the  stochastic  process  approach  and  future  research  directions  toward 
the  ultimate  goal.  In  Section  8,  we  conclude  the  paper. 

1.3.  Notation 

The  principal  notation  used  throughout  the  paper  is  summarized  as  follows. 

G  =  (V,  E):  This  is  a  graph  that  abstracts  a  networked  system  with  n  =  |  V|  nodes 
of  average  node  degree  /d. 
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deg (v),q(v),K(v):  In  the  steady  state,  a  node  v  G  V  is  compromised  with  prob¬ 
ability  q(v),  and  among  the  v’s  deg(v)  neighbors,  K(v )  neighbors  are 
compromised. 

q,K:  In  the  steady  state,  the  probability  that  a  uniformly  chosen  node  is  com¬ 
promised  is  denoted  by  q ,  and  the  number  of  a  uniformly  chosen  node’s 
compromised  neighbors  is  denoted  by  K,  where  q  =  q(v)/n  and 

K  =  I2ve\/K(v)/n- 

{(Sj(v),Cj(v))\  j  >  1}:  This  is  the  sequence  of  node  u’s  cycles  corresponding 
to  the  alternating  renewal  process,  where  each  cycle  consists  of  the  time 
interval  Sj  ( v )  during  which  v  is  secure  and  the  time  interval  Cj  ( v )  during 
which  v  is  compromised. 


2.  A  Brief  Review  of  the  Model  and  Results  in  [Li  et  al.  1 1] 

We  start  with  a  brief  review  of  [Li  et  al.  11].  As  mentioned  above,  a  networked 
system  is  represented  as  a  complex  network  G  =  (V,  E),  where  V  is  the  set  of  n 
nodes  or  vertices,  namely  |V|  =  n,  that  abstract  (for  example)  computers,  and 
E  is  the  set  of  edges  that  abstract  the  internode  communications  or  interactions 
that  can  be  exploited  to  carry  out  direct  attacks.  At  any  point  in  time,  a  node  v  € 
V  is  either  secure  or  compromised.  These  abstractions  are  not  new.  The  novelty 
of  [Li  et  al.  11]  is  the  attack-defense  process  over  G,  which  is  specified  by  the 
following  random  variables: 

•  X-\ :  The  time  a  secure  node  becomes  compromised  directly  by  the  attacker 
outside  of  G.  This  models  “drive-by  download” -like  attacks  [Provos  et  al.  07]. 

•  The  time  a  secure  node  becomes  compromised  because  of  its  itli 
compromised  neighbor,  1  <  i  <  deg(n).  This  models  attacks  such  as  stan¬ 
dard  malware  spreading  within  a  network. 

•  Y\ :  The  time  a  compromised  node  becomes  secure  again  because  the  com¬ 
promise  has  been  detected  and  cured.  This  models  reactive  defense. 

•  Y‘2‘.  The  time  a  compromised  node  becomes  secure  again  for  any  other  reason 
(e.g.,  patching  or  reinstalling  the  software  system  even  if  the  compromise 
was  not  detected).  This  models  proactive  defense. 

As  mentioned  above,  the  main  security  measure  of  interest  is  q ,  the  probability 
that  a  uniformly  chosen  node  is  compromised  in  the  steady  state.  This  imme¬ 
diately  leads  to  the  more  intuitive  security  measure  q  ■  n,  namely  the  expected 
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number  of  compromised  nodes  in  the  steady  state.  Since  in  the  steady  state  an 
arbitrary  node  v  G  V  is  compromised  with  probability  q(v),  we  have 

f  =  Et,gyg(^) 
n 

The  above  is  perhaps  the  most  important  conceptual  contribution  of  [Li  et  al.  11], 
which  is  very  simple  but  insightful  because  q  captures  the  global  property  of 
G  =  (V,  E)  in  the  presence  of  attacks  and  defenses.  The  main  research  task  is 
to  derive  an  expression  for  q ,  which  turns  out  to  be  a  difficult  problem  because 
the  states  of  the  nodes  are  random  variables  that  are  dependent  on  each  other 
according  to  the  complex  structure  of  G.  As  a  first  step,  [Li  et  al.  11]  focused 
on  the  special  case  in  which  G  is  regular  graph,  namely  all  nodes  have  the  same 
degree  and  the  q{v)’s  are  all  the  same  (i.e.,  independent  of  v).  Based  on  the 
following  assumption,  two  main  results  were  offered  in  [Li  et  al.  11]. 

Assumption  2.1.  [Li  et  al.  11] 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  the  exponential  distribution  with  rate  a. 

•  The  X2/s  follow  the  exponential  distribution  with  rate  7. 

•  Yi  follows  the  exponential  distribution  with  rate  (3. 

•  Y2  follows  the  exponential  distribution  with  rate  r/. 

2.  Assumption  on  the  independence  between  the  random  variables:  X±,  Y\, 
Y2,  and  the  X2y’s  are  independent  of  each  other. 


Theorem  2.2.  [Li  et  al.  11]  Suppose  G  is  regular  graph.  Under  Assumption  2.1,  we 
have 


1  +  m 


where  m  =  E 


P  +  V 
y.  +  'yK 


and  the  random  variable  K  indicates  the  number  of  compromised  neighbors  of  a 
node. 


Theorem  2.3.  [Li  et  al.  11]  For  a  regidar  graph  G  of  node  degree  p,  under  Assumption 
2.1,  we  have 

a  a  +  7/z 

-  <  g  <  - . 

a  +  (3  +  77  a  +  (3  +  p  + 
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Simulation  studies  in  [Li  et  al.  11]  showed  that  the  upper  bound  is  reasonably 
tight  not  only  for  regular  graphs,  but  also  for  Erdos-Renyi  random  graphs  and 
power-law  graphs  (in  the  latter  two  cases,  /x  is  naturally  the  average  node  degree) . 
This  means  that  we  can  plausibly  use  the  upper  bound  in  decision-making  when 
there  is  a  lack  of  information. 


3.  A  Brief  Review  of  Some  Probabilistic  Notions  Used  in  This  Paper 

3.1.  Some  Probabilistic  Distributions 

A  random  variable  Z  is  said  to  have  an  exponential  distribution  with  rate  A  >  0 
if  the  distribution  function  is 

F(z)  =  1  —  exp{— Xz}. 

This  distribution  has  the  memoryless  property,  which  played  an  essential  role 
in  [Li  et  al.  11].  The  exponential  distribution  is  a  special  case  of  the  Weibull 
distribution  with  distribution  function 


F(z)  =  l-exp{— (72)“}, 


where  7  >  0  and  a  >  0  are  scale  and  shape  parameters,  respectively. 
The  Lomax  (or  Pareto  II)  distribution  has  distribution  function 


F(z)  =  1  - 


where  /3  >  0  is  the  scale  parameter  and  77  >  0  is  the  shape  parameter  [Johnson 
et  al.  94].  This  distribution  is  interesting  because  it  can  model  heavy-tailed  phe¬ 
nomena. 


3.2.  Some  Notions  of  Nonindependence  between  Random  Variables 

Two  random  variables  Z\  and  Z2  are  said  to  be  positively  quadrant  dependent 
(PQD)  if  for  all  z\ ,  z2  G  1R  , 

P(Z1  >  zi,Z2  >  zi )  >  P(Z1  >  zi)P(Z2  >  z2). 

In  general,  a  family  {Z\,...,Zn}  of  random  variables  is  said  to  be  positively 
upper  orthant  dependent  (PUOD)  if  for  all  z^  G  IR  ,  i  =  1, . . . ,  n, 

n 

P(Z1  >  zi, . . . ,  Zn  >  zn)  >  JJ  P(Zi  >  Zi). 

i=  1 
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A  family  {Zi, . . . ,  Znj  of  random  variables  is  said  to  be  positively  associated 
(PA)  if 

Cov(f(Z1,...,Zn),  g(Z1,...,Zn))>0 

for  all  coordinatewise  nondecreasing  functions  /,  g  on  IR  n  as  long  as  the  co- 
variance  exists.  Finally,  we  recall  the  multivariate  Marshall-Olkin  model,  which 
offers  a  good  deal  of  flexibility  in  modeling  dependence  structures  [Marshall  and 
Olkin  67,  Li  08]. 

Let  {Eb,B  C  {1,2, ...  ,6}}  be  a  sequence  of  independent,  exponentially  dis¬ 
tributed  random  variables,  where  EB  has  mean  \/\B.  Let 

Tj  =  min{ EB  :  j  G  B},  j  =  1, . . . ,  b. 

The  joint  distribution  of  T  =  (Tj, . . .  ,  T&)  is  called  the  Marshall-Olkin  exponen¬ 
tial  distribution  with  parameters  {A b,B  C  {1,2,...  ,b} }. 

It  is  known  [Barlow  and  Proschan  81,  Theorem  3.2,  Chapter  2]  that  PA  => 
PUOD(PQD),  meaning  that  PA  is  (not  necessarily  strictly)  stronger  than  PUOD 
(PQD).  It  is  also  known  that  the  Marshall-Olkin  exponential  distribution  is  PA, 
and  hence  PUOD  [Mueller  and  Stoyan  02]. 


4.  New  Results  for  the  Case  in  Which  the  Attack-Defense  Process  Is  Observed 

In  this  section,  we  present  new  analytical  results  for  deriving  q(v),  methods  for 
deriving  q  in  practice,  and  bounds  of  q  in  the  absence  of  sufficient  information. 

4.1.  New  Analytical  Results  for  Deriving  q(v)  under  Weaker  Assumptions 

We  seek  to  substantially  weaken  Assumption  2.1  while  considering  more  gen¬ 
eral  graph  structures  (i.e. ,  not  just  regular  graphs).  We  start  with  the  following 
assumption. 

Assumption  4.1.  (A  substantially  weaker  assumption.) 

1.  Rather  than  assuming  that  X\ ,  the  X^y’s,  Y\ ,  and  Yj  are  exponential 
random  variables,  we  assume  the  following: 

•  Xi  follows  an  arbitrary  continuous  distribution  Tj  with  finite  mean. 

•  X'i  t  follows  an  arbitrary  continuous  distribution  with  finite  mean, 
where  the  distributions  may  be  different  for  different  f’s. 
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•  Y\  follows  an  arbitrary  continuous  distribution  G\  with  finite  mean. 

•  Y2  follows  an  arbitrary  continuous  distribution  G-2  with  finite  mean. 

2.  Rather  than  assuming  that  Xi,  the  X2ij’s,  Y\ ,  and  Y2  are  independent  of 
each  other,  we  assume  that  X\,  the  X2jj’s,  Y\ ,  and  Y2  can  be  mutually 
dependent. 

In  practice,  Assumption  4.1  can  be  tested  using  some  nonparametric  methods 
(cf.  [Hollander  and  Douglas  99,  Chapter  8]). 


Theorem  4.2.  Denote  the  joint  survival  function  for  {Xi , X2ji ,X2 }  by 
Hn+i(x\, . . .  ,xn+i)  =  P  ( Xi  >  xi,X22  >  x2,  ■  ■  ■  ,X2  n  >  a:?l+i) , 
and  the  joint  survival  function  for  Y\  and  Y2  by 

G(yi,y2)  =  P(Yi  >  yi,Y2  >  y-2) . 


Under  Assumption  j.l,  the  probability  q(v)  that  an  arbitrary  node  v  €  V  is 
compromised  in  the  steady  state  is 


q{v) 


1 

1  +  to’ 


(4.1) 


where 


‘  /*oo 

m  =  E  /  HK(v)+1(x,...,x)dx 

J  o 


I J  G(x ,  x)  dx, 


and  K{v)  is  the  number  of  compromised  neighbors  of  node  v. 


Proof.  Observe  that  in  general,  the  states  of  node  v  £  V  can  be  modeled  as  an 
alternating  renewal  process,  with  each  cycle  composed  of  two  states:  secure  with 
random  time  Sj(v)  and  compromised  with  random  time  Cj(v)  for  j  =  1,2, ... . 
Under  Assumption  4.1,  for  each  cycle  j  we  can  express  (Sj(v),Cj(v))  as 

Sj(v)  =  S(v)  =  min  {X2 ,  {X2^;  1  <  i  <  K(v)}}  ,  Cj(v)  =  C(v)  =  min{Yi,  y2}  , 
where  K(y)  is  the  number  of  v’s  compromised  neighbors.  Hence, 

E [S(v)]  =  E[mm{X1 ,  X2A , . . . ,  X2iK{v) }] 

=  E  [E  [min{X’1,X2jl, . . .  ,X2tK(v)}  I  A»]] 

■  POO 

—  E  /  HK(v)+i(x,...,x)dx  . 

.Jo 

Similarly,  the  time  that  a  node  is  compromised  in  a  cycle  is 

POO 

E[C(u)]  =  E[min{Yi,  y2}[  =  /  G(x,x)dx. 

Jo 
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Note  also  that  Assumption  4.1  implies  that 

E[<S(u)  +  C(v)\  <  oo. 

This  and  the  fact  that  the  distribution  of  S(v)  +  C(v)  is  nonlattice  guarantee 
(cf.  [Ross  96,  Theorem  3.4.4])  that  the  process  is  steady,  and  moreover, 

=  E\C(v)\  =  /q°°  G(x,  x)  dx  _ 

q  E[s(v)]  +  Elc(v)]  E  [/0°°  HK(v)+i{x,  . . .  ,x)  dx]  +  /0°°  G(x,x)  dx 

The  desired  result  follows.  □ 

The  following  corollary  relates  Theorem  2.2  to  Theorem  4.2. 


Corollary  4.3.  Theorem  2.2  is  a  special  case  of  Theorem  f.2. 


Proof.  By  replacing  Assumption  4.1  with  Assumption  2.1  and  letting  G  be  a  regular 
graph  in  Theorem  4.2,  we  obtain 


to  =  E 


=  E 


HK(v\+ i(x,  . . . ,  x)  dx  dx 


;  K(V) 

Fi(x)  H  F2,i(x)dx 
2=1 


G(x,  x) dx 


Gi(x)G2  (x)  dx 


=  E 
=  E 


/*oo 

/  exp{— ax}  exp{— K(v)  ■  7] -dx 

Jo 


P  +  V 

a  +  7  K(v) 


I J  exp{— (3x}  exp{— rjx}  dx 


Therefore,  the  resulting  q,  namely  q(v)  in  the  case  of  regular  graphs,  is  the  same 
as  in  Theorem  2.2.  □ 


Now  we  consider  the  case  that  the  random  variables  are  independent  of  each 
other  but  follow  the  Weibull  distribution  (rather  than  the  exponential  distribu¬ 
tion  as  required  in  [Li  et  al.  11]). 


Assumption  4.4.  (Weaker  than  Assumption  2.1  but  stronger  than  Assumption  4.1.) 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  the  Weibull  distribution  with  survival  function  Fi(t)  = 

g— (Ai *)"  _ 

•  The  Wu’s  follow  the  Weibull  distribution  with  survival  function 

F2(t)  =e-(A2t)“. 
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•  Y\  follows  the  Weibull  distribution  with  survival  function  Gi(t)  = 

g-tiit)'3  _ 

•  Y-2  follows  the  Weibull  distribution  with  survival  function  G2O)  = 

g-(72t)/i. 

2.  Xi,  the  X^j’s,  Yi ,  and  Yi  are  independent  of  each  other  (as  in  Assump¬ 
tion  2.1). 

In  practice,  the  Weibull  distribution  in  Assumption  4.4  can  be  tested  using 
standard  statistical  techniques  such  as  QQ-plot  and  the  Kolmogorov-Smirnov 
test  (cf.  [Lehmann  and  Romano  05,  Chapter  14]).  The  following  result  is  more 
general  than  Theorem  2.2  but  less  general  than  Theorem  4.2. 

Theorem  4.5.  Replacing  Assumption  ^.1  in  Theorem  J^.2  with  Assumption  we 
have 

m  rq  +  bJ  tlAJrT 

r(i  +  i)  [(A?  +  a».a;)1,“ 

Proof.  Note  that 

roo  roo 

/  G\{x)G2{x)  dx  =  /  exp  {- [(qi®)'3  +  {^x)13}  }  dx 

Jo  Jo 

=  j  exp  |  —  qf  +  72  xP 1  dx 

~~wrrr^' 

where  T(-)  is  the  gamma  function.  Note  also  that 

roo  1  r  roo 

E  /  Fi(x)Fr f^v\x)dx  =E  /  exp  {— [(\ix)a  K(v)  •  (\2x)a]}  dx 
Jo  J  Uo 

roo 

=  E  /  exp{—  [(Ay  +  K(v)  -  X 
Jo 

=  r  (X  +  a)  E  A]*  +  I< (v)  ■  Af 
Using  Theorem  4.2,  we  have  the  desired  result.  □ 
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4.2.  Applying  the  Analytical  Results  to  Obtain  q  in  Practice 

Now  we  show  how  to  use  the  above  analytical  results  to  derive  methods  for 
obtaining  q.  While  allowing  Xx ,  the  X2/s,  L) ,  and  Y2  to  be  possibly  dependent 
on  each  other,  we  further  consider  the  (in) dependence  between  the  states  of  a 
node’s  neighbors.  This  leads  to  three  cases: 

(i)  In  the  case  that  G  is  a  regular  graph  of  node  degree  /i  and  the  dependence 
between  neighbors  can  be  ignored,  we  offer  closed-form  results  in  Section 
4.2.1. 

(ii)  In  the  case  that  G  is  an  arbitrary  graph  and  the  dependence  between  neigh¬ 
bors  can  be  ignored,  we  offer  approximation  results  in  Section  4.2.2. 

(iii)  In  the  case  that  G  is  an  arbitrary  graph  and  the  dependence  between  the 
neighbors  cannot  be  ignored,  we  offer  statistical  results  in  Section  4.2.3. 


4.2.1.  Regular  Graphs  with  Ignorable  Dependence  between  Nodes.  In  case  (i),  every  node  has 
fi  neighbors,  denoted  by  1,2 which  are  compromised  in  the  steady  state 
with  probability  q\  =  q2  =  •  •  •  =  =  q.  Let  be  a  Bernoulli  random  variable 

with  parameter  q,  which  is  the  probability  that  the  zth  neighbor  is  compromised. 
Then  K ( v )  =  i  2)  is  a  binomial  distribution  with  parameter  q ,  and 

P(K(V)  =  k)  =  Q  qk  (1  -  qr~k ,  k  =  0,  .  .  .  ,  fi. 

According  to  Theorem  4.2,  which  requires  Assumption  4.1,  we  have 


m  =  E 


Hk(v)+i(x,...,x)  dx 


G{x , x )  dx 


=  E 

k  =  0 


/Mjfe/ 


qk(l-qY  k  I  Hk+i(x,...,x)dx 


G(x,  x)  dx. 


Hence,  we  have 


E 

k= 0 


/M  fc+i 


(1-9) 


H~  k 


Hk+1(x, . . .  ,a :)da 


G(x,  x)  dx  =  1. 


_  (4-2) 

In  order  to  solve  (4.2),  we  need  to  know  the  specific  Hk+\(x, . . . , x)  and  G(x,  x) 
that  can  be  obtained  in  scenarios  such  as  the  following: 
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Closed- form  result  under  Assumption  2.1,  which  is  stronger 
tion  4.1:  In  this  special  case,  we  have  Theorem  2.2  and  thus 

(zy«a-.r*  =  i. 


k  =  0 


Closed- form  result  under  Assumption  4.4,  which  is  stronger 
tion  4.1.  In  this  special  case,  we  have  Theorem  4.5,  and  thus 


«+E 


k= 0 


A*  )  „k+ _  q'jV-k  ^(^o) 


(1  +  ?) 


(Wj 

(A"  +  k\%) 


i  IP 


l/o 


than  Assump- 
(4.2)  becomes 

(4.3) 

than  Assump- 
(4.2)  becomes 

=  1-  (4-4) 


When  /j,  is  not  large,  the  solutions  for  q  in  (4.3)  and  (4.4)  can  be  derived  using  an 
appropriate  iterative  algorithm,  such  as  the  Newton-Raphson  method  [Stoer  and 
Bulirsch  02].  When  [i  is  large,  it  might  be  infeasible  to  derive  explicit  solutions 
for  q,  but  we  can  instead  use  the  approximation  result  in  Section  4.2.2. 


4.2.2.  Arbitrary  Graphs  with  Ignorable  Dependence  between  the  Nodes.  For  an  arbitrary  node 
t)£V,  let  I,  be  a  Bernoulli  random  variable  with  parameter  g,,  which  is  the 
probability  that  the  2th  neighbor  of  node  v  is  compromised.  Note  that  K(y)  = 
^deg(iO  Suppose  the  mean  and  variance  of  K(y)  exist  and  we  can  estimate 
them  (e.g.,  based  on  sufficiently  many  observations  on  v’s  neighbors).  Set 

k  =  E[/sT(u)],  er2  =  Var  (K(y)). 


In  this  case,  we  are  unable  to  give  closed-form  solutions,  and  instead  propose 
using  two  approximations: 

(a)  Normal  approximation:  If  a2  — »  oo  as  deg(u)  — »  oo,  by  Lindeberg’s  theorem 
(cf.  [Billingsley  95,  p.  359]), 

According  to  Theorem  4.2,  which  requires  Assumption  4.1,  we  have 


?n  =  E  Hk(v)+i{x,.  . .  ,x)dx 

ndeg(w)  i 

, —  exp 
V27TCT 


G(x,  x)  dx 


(, k  -  k) 
2a2 


/r°°- 

x  i/fc+i  (x, . . . ,  x)  dk  dx  /  /  G(x,x)dx. 


(4.5) 
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As  a  result,  we  can  obtain  more  specific  approximation  solutions  as  follows: 


•  Approximation  result  under  Assumption  2.1,  which  is  stronger  than  As¬ 
sumption  4.1.  In  this  special  case  we  have  Theorem  2.2,  and  thus  (4.5) 
becomes 

P  +  TI  [deg{v)  1 


V/27t< 


7RT  Jo 


a  +  7# 


exp 


(k  -  k)2 

2<j2 


dk. 


•  Approximation  result  under  Assumption  4.4,  which  is  stronger  than  As¬ 
sumption  4.1.  In  this  special  case,  we  have  Theorem  4.5,  and  thus  (4.5) 
becomes 


TO 


r(i  +  ^)(7f  +  7f)1//3 

\/27roT(l  +  i) 

i'deg(v)  1 


/  0  (Ay  +  k\%) 


1A 


exp 


(fc  -  k)2 
2^~ 


dk. 


(b)  Poisson  approximation.  Note  that  k  >  0.  If  deg(u)  — >  oo  and 

max{gi, . . .  ,gdeg(w)}  0, 

then  [Billingsley  95,  Theorem  23.2]  says  that 

(U\i 

P(K(v )  =  i)  «  exp  {— fc}  -rp,  i  =  0,1, . . .  ,deg(t>). 
According  to  Theorem  4.2,  which  requires  Assumption  4.1,  we  have 


to  =  E 


Uo 

deg(?; 


HK<v)+i{x,...,x)dx 


G(x,  x) dx 


-  (kY  f°°  -  /  r°° 

exp{—  fc}  — j—  Hi+i  (x, . . . ,  x)  dx  j  J  G(x,x)dx.  (4.6) 

As  a  result,  we  can  have  more  specific  approximation  solutions  as  follows: 

•  Approximation  result  under  Assumption  2.1,  which  is  stronger  than  As¬ 
sumption  4.1.  In  this  special  case,  we  have  Theorem  2.2,  and  thus  (4.6) 
becomes 


deg(u) 


to  ss  (/3  +  77)  exp  {— fc} 


i= 0 


1  (W 

a  +  7*  i! 


•  Approximation  result  under  Assumption  4.4,  which  is  stronger  than 
Assumption  4.1.  In  this  special  case,  we  have  Theorem  4.5,  and  thus 
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(4.6)  becomes 


TO 


r(i  +  Pbf  +  T?) 
r(i  +  b 


1/ P  deg(u) 

—  E 

*=0 


( k)1  exp  {— k} 

(A“  +  i  ■  Aj  )1/a  i\ 


In  both  normal  approximation  and  Poisson  approximation,  having  obtained  to 
allows  one  to  derive  q(y)  according  to  (4.1)  in  Theorem  4.2.  Note  that  this 
method  is  sound  because  both  Assumption  2.1  and  Assumption  4.4  are  stronger 
than  Assumption  4.1,  which  underlies  Theorem  4.2.  From  a  sufficiently  large 
sample  V'  C  V  with  each  v  €  V'  being  uniformly  chosen  from  V,  we  can  derive 

=  Hvev><l(v)/ |v'|. 


4.2.3.  Arbitrary  Graphs  with  Nonignorable  Dependence  between  the  Nodes.  If  the  dependence 
between  neighbors  cannot  be  ignored,  the  probability  mass  function  of  K{v)  may 
be  estimated  from  data  as  follows.  Suppose  we  have  observations  (fci,  k2,  ■  •  • ,  h) 
on  K(v),  where  b  is  the  number  of  observations  on  the  number  of  v’s  compromised 
neighbors  in  the  steady  state  (i.e.,  at  b  different  points  of  observation  time).  Then 
the  probability  mass  function  of  K{v)  can  be  estimated  as 

Pk  =  P(K(v)  =k)=  ^~'^1 - — ,  b  ->  00, 

b 

where  I(-)  is  the  indicator  function.  Thus,  according  to  Theorem  4.2,  we  have 


m  =  E 


P k (v)+i(*L  •  ■  •  1  *^)  dx 


G(x,  x ) dx 


deg  (r* )  poQ  j  />  00 

=  y ^  Pk  I  Hk+i{x,  ■  ■  ■  ,x)  dx  /  /  G(x,x)dx.  (4.7) 

k—0  J°  '  J° 


To  be  specific,  let  us  consider  a  concrete  example  under  the  following  assumption: 


Assumption  4.6.  (Weaker  than  Assumption  2.1  but  stronger  than  Assumption  4.1.) 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  the  exponential  distribution  with  parameter  A. 

•  The  X2;l’s  follow  the  multivariate  Marshall-Olkin  exponential  model. 

•  Y\  and  Y2  follow  the  bivariate  Marshall-Olkin  exponential  model. 

2.  The  X2/s  can  be  dependent  on  each  other;  Y\  and  Y2  can  be  dependent 
on  each  other. 
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Under  Assumption  4.6,  the  joint  survival  function  of  (X2i, . . . ,  X2  k )  is 
Hk(x  i,  ...,xk)  =  P{X  2ji  >*!,..  .,X2lk  >  xk) 

=  exp  <  -  ^  A iXi  -  ^  AtJ  max^j,^}  -  Ai2...*  max{ii, . . . ,  a:* }  >. 

^  i=l  i<j  ' 


Therefore, 


/  Hk+i  (x, . . . ,  x)dx  =  /  exp{— At}  exp{— kt}  dt 

I  o  Jo 

1 

~  A  +  <j)(X,  k)  ’ 


where 


<t>( A,  fc)  —  A*  +  Ajj  +  •  •  •  +  Ai2...fc  - 

2—1  i<j 

Similarly,  the  joint  survival  function  of  (Y-i .  Y2)  is 

G{yi,y2)  =  P{Yi  >  y\,Y2  >  y2) 

=  exp  {— 71  t/i  -  722/2  -  7i2  max{yi, y2}}  . 

Then 

1 


Using  (4.7),  we  have 


m 


/  G(a:,  x)  dx  =  — 

/o  7i  +  72  +  7i2 


deg(v) 

=  (71 + 72 + 712)  X  xrlfetr 


Having  obtained  in,  one  can  derive  q(y)  according  to  (4.1)  in  Theorem  4.2.  Note 
that  this  method  is  sound  because  Assumption  4.6  is  stronger  than  Assumption 
4.1,  which  underlies  Theorem  4.2.  From  a  sufficiently  large  sample  V'  C  V  with 
each  d£V'  being  uniformly  chosen  from  V,  we  can  derive  q  =  YlveV  9(1')/I^T 


4.3.  New  Bounds  on  q 

Above,  we  discussed  how  to  obtain  q  in  various  settings.  When  one  cannot  obtain 
q  (for  example,  due  to  the  lack  of  sufficient  information),  one  can  instead  use 
the  bounds  of  q  in  decision-making.  This  is  plausible  as  long  as  the  bounds 
are  reasonably  tight,  and  reasonable  because  it  requires  much  less  information 
to  compute  the  bounds.  In  particular,  one  could  use  the  upper  bound  of  q  in 
decision-making  because  it  can  be  seen  as  a  sort  of  worst-case  scenario.  In  what 
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follows  we  will  differentiate  the  case  of  regular  graphs  from  the  case  of  arbitrary 
graphs  under  the  following  three  assumptions: 

Assumption  4.7.  (Assumption  underlying  our  general  bounds  of  q.) 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  an  arbitrary  continuous  distribution  with  survival  function 
F\  and  finite  mean. 

•  The  X2,i’s  are  PUOD  with  the  same  marginal  continuous  survival 
function  F '2  and  finite  mean. 

•  Y\  follows  an  arbitrary  distribution  with  continuous  survival  function 
G 1  and  finite  mean. 

•  Y-2  follows  an  arbitrary  distribution  with  continuous  survival  function 
G2  and  finite  mean. 

2.  X\  and  the  X^y’s  are  independent,  and  Y\  and  Y2  may  be  dependent  on 
each  other. 

Assumption  4.8.  (Assumption  underlying  one  specific  bound  of  q.) 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  the  exponential  distribution  with  rate  a. 

•  The  X2/S  are  PUOD  with  the  same  marginal  distribution  function 
F2  of  the  exponential  distribution  with  rate  7. 

•  Y-\  follows  the  exponential  distribution  with  rate  /3. 

•  Y2  follows  the  exponential  distribution  with  rate  77. 

2.  X\  and  the  X2,i ’s  are  independent,  and  Y]  and  Y2  are  independent  of 
each  other. 

Assumption  4.9.  (Assumption  underlying  another  specific  bound  of  q.) 

1.  Assumption  on  the  distributions  of  the  random  variables: 

•  Xi  follows  the  Lomax  distribution  with  scale  parameter  A  and  shape 
parameter  a.\  >  1. 

•  The  X2/s  are  PUOD  with  the  same  marginal  distribution  function  F2 
of  the  Lomax  distribution  with  scale  parameter  A  and  shape  parameter 

OL2  >  1. 
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•  Y\  follows  the  Lomax  distribution  with  scale  parameter  7  and  shape 
parameter  f3i  >  1. 

•  Y2  follows  the  Lomax  distribution  with  scale  parameter  7  and  shape 
parameter  /?2  >  1- 

2.  X\  and  the  X^j’s  are  independent,  and  Y\  and  Y2  are  independent  of 
each  other. 


In  practice,  the  PUOD  dependence  assumption  can  be  tested  by  the  method 
suggested  in  [Scaillet  05],  while  the  exponential  distribution  assumption  and  the 
Lomax  distribution  assumption  can  be  tested  using  QQ-plot  or  the  Kolmogorov- 
Smirnov  test  (see  [Lehmann  and  Romano  05]). 


4.3.1.  Bounds  on  q  in  the  Case  of  Regular  Graphs.  We  have  the  following  theorem. 

Theorem  4.10.  Suppose  G  is  regular  graph  of  node  degree  p.  Under  Assumption  f.7, 
in  the  steady  state  we  have 


/»oo  /  /*oo 

J  G(x,  x)dx  J  [.Fi  ( x )  +  G(x ,  x)\  dx 


<  q  <  /  G(x ,  x)dz 


Fi  (x)F%  (a:)  +  G(x,  x)  dx, 


where  k  is  the  expected  number  of  compromised  neighbors  for  an  arbitrary  node. 

Proof.  Denote  by  1, 2, . . . ,  p  the  neighbors  of  an  arbitrary  node  o€V.  Let  be  the 
indicator  function  that  I,  =  1  if  and  only  if  the  itli  neighbor  of  v  is  compromised, 
and  let  qi  be  the  probability  that  the  z'tli  neighbor  is  compromised.  Under  As¬ 
sumption  4.7,  the  system  will  enter  the  steady  state  (see  [Ross  96,  Theorem 
3.4.4]).  Since  G  is  regular  graph,  we  have  qi  =  •  •  •  =  =  q.  In  the  steady  state, 

we  have 


k  =  E[K  (n)]  =  E 


=  qs ■ 
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Assume  that  the  X^j’s  are  PUOD  with  the  same  marginal  survival  function 


=  E  J  F\  (x)Hx („)  (x, . . . ,  x)  dx  j  J  G(x,x)dx 

‘  poo  “I  /  poo 

>E  J  Fi(x)F^l'\x)  dx  /  J  G(x,x)dx 

POO  I  POO 

>  J  Fi(x)F2^K^  (x)  dx  /  J  G(x,x)dx 


=  J  Fi(x)F2(x)dx  j  J  G(x,x)dx,  (4.8) 

where  the  last  inequality  follows  from  Jensen’s  inequality.  Using  (4.1)  in  Theorem 
4.2,  one  can  obtain  the  upper  bound  of  q. 

On  the  other  hand,  we  observe  that  for  every  x  >  0,  we  have 

HK(v){x,  ■  ■  -,x)dx  <  1. 

The  upper  bound  of  m  can  be  derived  as 

P  00  j  POO 

m<J  Fi(x)  dx  I  J  G(x,x)dx.  (4-9) 

Using  (4.1)  in  Theorem  4.2,  one  can  obtain  the  lower  bound  of  q.  □ 

Theorem  4.11.  Suppose  G  is  a  regidar  graph  of  node  degree  p.  Under  Assumption 
f.8,  in  the  steady  state  we  have 

a  — (/3  +  p  +  a  -  /ry)  +  y^/3  +  ry  +  a  —  py)2  +  4pya 


a  +  (3  +  7 


<<?< 


Proof.  From  (4.8),  it  follows  that 


a  +  qpy 


Using  (4.1)  in  Theorem  4.2,  we  have 


This  leads  to 


Hence 


1  1 

Q  ~  m.+  l  ~  P+i  +l 

a  +  qfi^f 


q2 py  +  q{/3  +  r]  +  a  —  py)  —  a  <  0. 


—  (/3  +  p  +  a  —  py)  +  +  p  +  a  -  py)2  +  Apya 
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On  the  other  hand,  according  to  (4.9)  and  (4.1),  one  can  derive 


a  +  0  +  7 


Theorem  4.12.  Suppose  G  is  a  regular  graph  of  node  degree  p.  Under  Assumption 
4-9,  in  the  steady  state  we  have 

_ 7(<*i  -  1) _  <  <  -B  +  sjB 2  +472a2(ai  -  l)p 

A(/3i  +02-1)  +  7(ai  -  1)  ~  q  ~  2a2^p 

where  B  =  \{0\  +  02  —  1)  +  7(«i  —  1)  —  ')CX2p- 

Proof.  Note  that  Ft  has  Lomax  survival  function 

Fi( x)  =  (l+^)  ,  Oi  >  1,  i  =  1,2. 

Similarly,  G,  has  Lomax  survival  function 

Gi{x)=(^l+^j  ,  0i>  1,  2  =  1,2. 

From  (4.8),  it  follows  that 

A  /  7  A(/3i  +02-1) 

TYl  ^  -  /  -  =  - . 

-  a\  +  o.2pq  -  1/  01+02  —  1  7(ai  +  Ci2pq  -  1) 

Using  (4.1)  in  Theorem  4.2,  we  have 

7(«i  +  a2pq  -  1) 


This  leads  to 


where 


A(/?i  +  /32  -  1)  +  7(«i  +  «2W  -  1) ' 


q2a.2r)p  +  qB  -  7(0:1  —  1)  <  0, 


B  =  A(/?i  +02—1)  +  7(cei  -  1)  -  7tt2/u. 


Hence,  the  upper  bound  of  q  is 

<  —B  +  sjBS  +  Arf1 0.2  (oil  -  l)p 
2a27  lL 

On  the  other  hand,  according  to  (4.9)  and  (4.1),  we  have 

^  A(/?i  +02  —  1) 
m  ~  7(ai  -  1)  ' 
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So  the  lower  bound  of  q  is 


q> 


_ 7(«i  ~  1) _ 

A(/?i  +  /32  -  1)  +  t(«i  -  1) ' 


□ 


4.3.2.  Bounds  of  q  in  the  Case  of  Arbitrary  Graphs.  For  an  arbitrary  node  v  £  V,  it  follows 
that 

deg(v) 

K{v)  =  Y  Iu 

*=i 

where  I,  is  a  Bernoulli  random  variable  with  parameter  g,,  which  is  the  prob¬ 
ability  that  the  2th  neighbor  of  v  is  compromised.  Note  that  the  Id s  may  be 
dependent  on  each  other  and  the  qC s  may  be  different.  Note  also  that  in  this 
case,  we  cannot  obtain  a  counterpart  of  Theorem  4.10,  because  the  concavity 
needed  to  apply  Jensen’s  inequality  may  not  hold.  As  such,  we  are  able  to  give 
only  the  following  specific  results  in  this  more  general  case. 

Theorem  4.13.  Suppose  G  is  an  arbitrary  graph  with  average  node  degree  p.  Under 
Assumption  4-8,  in  the  steady  state  we  have 

a  a  +  7/r 

-  <  g  <  - . 

a  +  /3  +  7  a  +  /3  +  p  +  -yp 


Proof.  In  a  fashion  similar  to  the  proof  of  Theorem  4.10,  we  can  show  that  for  an 
arbitrary  node  v, 

a  .  .  a  +  7E[Iv  (n)] 

a  +  /3  +  7  -  q  ~  a  +  P  +  r]  + 'yE[I<(v)\ 

Since 

n_  J2ve  v  gfo) 

n 


it  follows  that 


>  01 
1  ~  a  +  P  +  7 

On  the  other  hand,  because 


deg(u) 


i—  1 


deg(n) 

=  Y  qi  -  cle§('u)’ 

i=  1 


E[K(v)]  =  E 
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we  have 


Hence 


Therefore, 


Because 


a  +  7E[iL(u)] 


a  +  /?  +  rj  +  7E[iL(v)]  a  +  /3  +  7  +  7 deg(v) ' 
a  +  7  deg(u) 


a  +  7deg(i>) 


q(v)  < 


<-£ 


a  +  /3  +  rj  +  7deg(u) 
a  +  7deg(u) 


n  a  +  (3  +  r]  +  7  deg(u) 


,  /  x  a  +  7a; 

/i(x)  =  - - - ,  x  >  0, 

a  +  /3  +  rj  +  jx 

is  a  concave  function,  Jensen’s  inequality  yields 

ly  a  +  7  deg(n)  ^  a  +  7^  Euev  des(v) 

n  E 

wGV 

Hence,  by  (4.10),  we  have 


n-a  +  /Hr,  +  7degW  ''  a  +  (3  +  rj  +  7^  £„eV  deg(n)  ‘ 


g< 


ce  +  7/i 


a  +  /?  +  7  +  7/r 

This  completes  the  proof. 

In  a  similar  fashion,  we  can  prove  the  following  result. 


(4.10) 


□ 


Theorem  4.14.  Suppose  G  is  an  arbitrary  graph  with  average  degree  p.  Under  As¬ 
sumption  4-9,  in  the  steady  state  we  have 

_ 7(07  ~  1) _  <  <  _ 7(0:1  +  a2p  -  1) _ 

A(/?i  +  (32-  1)  +  7(0:1  -  1)  ~q  ~  A(/?i  +  /32  -  1)  +  7(01  +  a2p  -  1) ' 


5.  New  Results  for  the  Case  in  Which  Only  the  Alternating  Renewal  Process  Is 
Observed 

The  results  obtained  thus  far  are  applicable  when  the  lower-level  attack-defense 
process  is  observed,  or  more  specifically  the  distributions  F\ ,  the  F2/s,  G\, 
and  G2  and  their  dependence  structure  (if  applicable)  are  observed.  What  if 
they  are  not  observed?  This  is  a  very  reasonable  question,  because  obtaining  all 
detailed  information  can  be  very  costly,  and  it  may  be  much  simpler  to  observe 
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the  higher-level  alternating  renewal  process  {(Sj(v),  Cj(v));  j  >  1}  mentioned 
above.  In  this  case,  neither  the  results  in  [Li  et  al.  11]  nor  the  results  described 
above  are  applicable,  which  motivates  us  to  present  the  following  results  for  this 
possibly  more  realistic  scenario. 

5.1.  Analytical  Results  for  Estimating  q(v) 

Let  pv (t)  be  the  probability  that  node  v  is  secure  at  time  t,  and  let  the  limiting 
average  probability  be 

1  f* 

p(v)  =  lim  -  /  pv(s)  ds. 
t_>0°  t  J  o 

For  node  v  G  V  with  {(Sj(v),Cj(v))  :  1  <  j  <b},  define 
Wb(v)  =  Si(v)  +  S2(v)  +  ■■■  +  Sb(v),  Db(v)  =  Ci(y)  +  C2{v)  +  •  •  •  +  Cb(v). 


Lemma  5.1.  [Marlow  and  Tortorella  95,  Theorem  1]  If 

V  Wb(v)  Db(v) 

lim  — - —  =  w[y)  a.s.  and  lim  — - —  —  d[v)  a.s 

6— >oo  b  6— xx)  b 

where  “a.s.”  means  “almost  surely”  and  0  <  w(v),d(v)  <  oo,  then 

w(v) 


P(v)  = 


w(v)  +  d(v) 


Assumption  5.2.  Assume  that  {Sj  (v)  :  j  >  1}  and  {Cj  (v)  :  j  >  1}  are  sequences  of 
pairwise  PQD  random  variables  with  finite  variances,  and 


(a)  EjliJ  2Cov  (Sj(v),  Wj(v))  <  oo ; 

(b)  E~i.T2Cov(C>),D>))<oo; 

(c)  supj>x  E \Sj(v)  -  E[S»]|  <  oo; 

(d)  supj>!  E  \Cj(v)  —  E[Cj(u)]|  <  oo. 


In  practice,  the  above  assumption  can  be  tested  as  follows:  PQD  dependence 
can  be  tested  using  the  statistical  methods  described  in  [Denuit  and  Scaillet  04] . 
To  test  for  finite  variance,  we  may  use  the  methods  described  in  [Resnick  07]  for 
heavy-tailed  data.  If  the  data  are  not  extremely  heavy-tailed,  we  may  assume 
that  the  data  have  finite  variance. 
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Lemma  5.3.  [Birkel  89,  Theorem  1]  Let  {Sj(v)  :  j  >  1}  be  a  sequence  of  pairwise 
PQD  random  variables  with  finite  variance.  Assume 

(a)  J~2Cov  (Sj(v),Wj(v))  <  oo. 

(b)  sup^  E \Sj(v)  -  E[5j(w)]|  <  oo. 

Then  as  b  — >  oo,  we  have  b~l(Wb(v)  —  E[Wf,(v)])  — >  0  almost  surely. 

Theorem  5.4.  Under  Assumption  5.2  and  given  a  sufficiently  large  number  b  of 
observations  on  the  states  of  node  v,  the  probability  q(v),  namely  the  probability 
that  node  v  is  compromised,  if  it  exists  (i.e.,  the  alternating  renewal  process  is 
steady),  can  be  expressed  as 

/  x  =  d*{v) 

^  w*(v)  +  d*(v)’ 

where  w*(v)  =  lim^,*,  E[Wj(u)]/&  and  d*{v)  =  lim^^  F,[Db(v)]/b. 


Proof.  According  to  Lemma  5.3,  conditions  (a)  and  (c)  imply 

Wb(v)  E [Wb(v)\ 

lim  — - —  =  Inn  - - - =  w  (v),  a.s 

b— >oo  0  b—>oo  0 

and  similarly,  (b)  and  (d)  imply 

Db{v)  E [Db(v)}  ^  , 

lim  — - —  =  lim  - - - =  a  (v),  a.s. 

b—>  oo  0  b—>oo  0 


According  to  Lemma  5.1, 


P(v)  = 


w*(v)  +  d*(v) ' 


The  limiting  probability  is 


p(v)  =  lim  pv(t), 

t—>oo 


which  means  that  there  exists  a  sufficiently  large  tg  such  that  for  every  e  >  0, 


\p(v)  ~Pv(t)\  <  e,  t>t0. 


Now  for  t  >  tg,  we  have 


Hence, 


1  U  1  1  f* 

-  pv  (s)  ds  <  -  Pv(s)  ds  +  -  /  pv(s)  ds. 
t  Jo  1  Jo  t  Jt0 

1  f1 

p(v)  =  lim  -  /  pv  ( s )  ds  <  p(v)  +  e. 

0°  t  Jo 
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Similarly, 


p{v)  >  p(v)  —  e. 


So  we  conclude  that 


and 


□ 


In  what  follows,  we  present  two  variants  of  Theorem  5.4.  In  the  first  variant, 
we  use  the  stronger  PA  assumption  to  replace  the  PQD  assumption  in  Theorem 
5.4  (or  more  precisely,  in  the  underlying  Assumption  5.2),  which  yields  that 
conditions  (c)  and  (d)  can  be  dropped.  To  see  this,  we  need  the  following  lemma. 


Lemma  5.S.  [Birkel  89,  Theorem  2]  Let  {Sj  (v)  :  j  >  1}  be  a  sequence  of  pairwise 
PA  random  variables  with  finite  variance.  Assume  that 

OO 

^j-2Cov(Sy(u),Wj(t;))  <  oo. 
i=  i 

Then  as  b  — >  oo,  we  have  6_1(Wf,(n)  —  E[Wf,(v;)])  — >  0  almost  surely. 


Assumption  5.6.  Assume  that  {Sj(v)  :  j  >  1}  and  { Cj(v )  :  j  >  1}  are  sequences  of 
pairwise  PA  random  variables  with  finite  variances,  and  that 


(a)  Ejti  J  2Cov  Wj(v))  <  oo; 

(b)  Er=ir2Cov(C',(n),Z)J(n))<oo. 


Theorem  5.7.  (One  variant  of  Theorem  5.4.)  Under  Assumption  5.6  and  given  a  sufficiently 
large  number  b  of  observations  on  node  v’s  states,  the  probability  q(v),  if  it  exists 
(i.e.,  the  alternating  renewal  process  is  steady),  can  be  expressed  as 


q(v) 


d*{v) 

w*(v)  +  d*(v) ' 


Proof.  The  proof  follows  from  Lemma  5.5  and  the  proof  of  Theorem  5.4.  □ 

In  the  following  we  present  another  useful  variant  of  Theorem  5.4  under  an¬ 
other  stronger  assumption. 
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Assumption  5.8.  Assume  that  {Sj  (v)  :  j  >  1}  and  { Cj(v )  :  j  >  1}  are  independent 
sequences  with  finite  variances. 

In  practice,  we  can  test  the  independence  for  sequences  {Sj  :  j  >  1}  and  {Cj  : 
j  >  1}  using  the  methods  described  in  [Hollander  and  Douglas  99]. 

Theorem  5.9.  (Another  variant  of  Theorem  5.4.)  Under  Assumption  5.8  and  given  a  suffi¬ 
ciently  large  number  b  of  observations  on  node  v’s  states,  the  probability  q(v),  if 
it  exists  (i.e.,  the  alternating  renewal  process  is  steady),  can  be  expressed  as 


w*(v)  +  d*(v) ' 


Proof.  The  theorem  follows  from  Kolmogorov’s  strong  law  of  large  numbers  and 
the  proof  of  Theorem  5.4.  □ 

The  following  corollary  expresses  the  connection  between  Theorem  5.9,  which 
corresponds  to  the  case  in  which  the  higher-level  alternating  renewal  process  is 
observed,  and  Theorem  2.2,  which  corresponds  to  the  case  in  which  the  lower- 
level  attack-defense  process  is  observed. 


Corollary  5.10.  Theorem  2.2  is  a  special  case  of  Theorem  5.9. 
Proof.  Under  Assumption  2.T  used  in  Theorem  2.2,  we  have 


Var(S(n))  <  2E 
Var((7(n))  = 


1 


a  +  K  (zj)  •  7 
2 

<  oo. 


,  1  . 

<  2  (  —  )  <oo, 
a , 


Hence,  the  conditions  in  Theorem  5.9  are  fulfilled.  Therefore,  we  have 


w*(v)  =  E[S(v)]  =E 


1 

a  +  jK(v)  J  ’ 


cf  (u)  =  E[C(u)] 


I 

P  +  v' 


As  a  result, 


q(v)  =  E 


P  +  V  1 
a  +  7 K  (v)  J 


which  is  the  same  as  in  Theorem  2.2. 


□ 
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5.2.  Applying  the  Analytical  Results  to  Estimating  q  in  Practice 


Theorems  5.4,  5.7,  and  5.9  require  observations  only  on  {(Si(v),  Ci(v));  l  >  1}. 
In  order  to  show  how  the  above  analytical  results  can  guide  the  computation  of 
q  in  practice,  we  use  Theorem  5.9  as  an  example.  Suppose  we  select  in  nodes, 
denoted  by  V\ , . . . ,  vm ,  uniformly  at  random  from  V  to  observe  their  respective 
{(Si(v),Ci(v))\l  >  l}’s.  For  each  vt,  1  <  i  <  m,  we  estimate  g(ty)  as  follows: 


1.  Observe  the  node  for  a  sufficiently  long  time,1  and  get  jt  observations 
(S'i(uI),C'i(uJ)), . . . ,  (Sji  (■ Vi ),Cji  ( Vi )). 

2.  Test  whether  { Si(vt )  :  ji  >  l  >  1}  and  {Q(vi);ji  >  l  >  1}  are  indepen¬ 
dent  sequences  (e.g.,  using  methods  described  in  [Hollander  and  Dou¬ 
glas  99]).  If  they  are  independent,  the  algorithm  continues  executing  the 
next  steps;  otherwise,  the  algorithm  aborts. 

3.  Test  whether  (S)(uj)  :  j  >  l  >  1}  and  {Ci(vi)\j  >  l  >  1}  have  finite  vari¬ 
ances  (e.g.,  using  the  methods  described  in  [Resnick  07]).  If  so,  the  algo¬ 
rithm  continues  executing  the  next  step;  otherwise,  the  algorithm  aborts. 


4.  Compute  W,  j 

5.  Compute 


Si(vi)  H - 1-  Sj(v i)  and  Ay 


q(vi) 


D 


Wi^j  +  Dij 


Ci  (vi )  +  •  •  •  +  Cj(vi). 


(5.1) 


Having  obtained  q{v i), . . . ,  q(vm),  we  can  compute 

m 

which,  while  intuitive,  is  not  unconditionally  applicable,  as  our  algorithm  demon¬ 
strates. 


6.  Related  Work 

As  extensively  discussed  in  [Li  et  al.  11],  there  have  been  some  (promising) 
attempts  at  quantitative  security  analysis  of  networked  systems.  In  what  follows 


1In  practice,  we  can  first  observe  the  time  interval  [0,ti]  and  use  (5.1)  to  compute  q(vi). 
Then  we  can  observe  a  longer  interval,  say  [0,^2]?  and  compute  q(vi)  again.  Assume  that  we 
collect  many  points  (q{vi),tz),  z  —  1,2, ... .  Then  we  may  plot  them  to  look  for  a  horizontal 
line.  If  a  horizontal  line  pattern  appears  after  tz  for  some  £,  we  may  say  that  tz  is  a  sufficiently 
long  time.  This  method  can  also  be  used  to  determine  whether  the  process  is  steady  in  practice. 
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we  discuss  three  main  approaches,  which  are  complementary  to  the  “stochastic 
process”  approach  of  [Li  et  al.  11]  and  the  present  paper. 

An  extensively  studied  approach  is  the  “attack-graph”  approach,  which  was 
initiated  in  [Phillips  and  Swiler  98].  Basically,  an  attack  graph  is  derived  from  a 
networked  system  and  its  known  vulnerabilities.  The  nodes  in  an  attack  graph 
are  the  states  of  the  networked  system,  and  the  edges  reflect  the  attack  steps  (i.e. , 
certain  computers  having  been  compromised  can  cause  the  compromise  of  more 
computers).  Along  this  line,  substantial  progress  has  been  made  in  the  last  decade 
[Jha  and  Wing  01,  Sheyner  et  al.  02,  Ammann  et  al.  02,  Noel  et  al.  03,  Ingols 
et  al.  06,  Ou  et  al.  06,  Wang  et  al.  07,  Sawilla  and  Ou  08,  Wang  et  al.  08a,  Wang 
et  al.  08b,  Xie  et  al.  10,  Huang  et  al.  11].  A  particular  application  of  attack  graphs 
is  to  (optimally)  harden  specific  assets  by  identifying  the  relevant  attack  paths. 
Our  approach  is  complementary  to  the  attack-graph  approach  because  of  the 
following.  First,  the  attack-graph  approach  mainly  focuses  on  studying  known 
(but  unpatched)  vulnerabilities.  The  only  exception  we  are  aware  of  is  the  recent 
extension  to  considering  the  number  of  unknown  vulnerabilities  that  are  needed 
in  order  to  compromise  some  specific  assets  [Wang  et  al.  10].  In  contrast,  our 
approach  aims  to  embed  unknown  vulnerabilities  into  the  model  inputs  (rather 
than  as  an  outcome  of  the  model).  Second,  the  attack-graph  approach  is  algo¬ 
rithmic  or  combinatorial  in  nature  (e.g.,  computing  the  number  of  attack  paths 
or  hardening  security  of  a  given  set  of  assets  with  minimal  effort).  In  contrast, 
our  approach  aims  to  model  the  evolution  of  node  states,  which  could  allow  us 
to  derive  basic  laws  or  principles  for  the  security  of  networked  systems. 

An  approach  that  is  closely  related  to  ours  is  what  we  call  the  “dynami¬ 
cal  system”  approach,  in  which  dynamical  system  models  play  an  essential  role. 
This  approach  was  rooted  in  epidemic  models  [McKendrick  26,  Kermack  and  Mc- 
Kendrick  27],  which  were  first  introduced  to  computer  security  for  studying  com¬ 
puter  viruses  in  [Kephart  and  White  91].  This  approach  has  been  coupled  with 
complex  network  structures  since  [Wang  et  al.  03,  Ganesh  et  al.  05,  Chakrabarti 
et  al.  08].  The  state-of-the-art  result  in  this  line  of  research  is  [Xu  et  al.  12],  which 
also  presented  perhaps  the  first  numerical  result  for  estimating  the  global  state 
that  is  comparable  to  our  concept  of  q.  While  the  results  in  [Xu  et  al.  12]  can 
accommodate  arbitrary  network  structures,  they  are  based  on  an  independence 
assumption  that  is  comparable  to  that  the  X^j’s  in  our  model  are  independent 
of  each  other.  As  shown  above,  we  aim  to  get  rid  of  this  independence  and  other 
assumptions  as  much  as  we  can. 

The  third  approach  is  what  we  call  the  “statistical  physics”  approach  (cf.  [Al¬ 
bert  and  Barabasi  02]  for  a  large  body  of  literature  and  its  numerous  follow-ons). 
This  approach  mainly  aims  to  characterize  the  robustness  of  network  connectiv¬ 
ity  (i.e.,  robustness  of  giant  components  in  the  presence  of  node  and/or  edge 
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deletions).  While  relevant,  there  is  a  fundamental  difference  between  network 
connectivity  and  security.  Specifically,  as  noted  in  [Li  et  al.  11],  a  malicious  at¬ 
tacker  may  aim  to  compromise  as  many  computers  as  possible.  However,  the 
ultimate  goal  of  the  attacker  may  be  to  steal  sensitive  information  rather  than 
to  disrupt  the  connectivity  of  networks.  In  other  words,  an  attacker  may  com¬ 
promise  sensitive  information  while  likely  not  disrupting  the  communications  of 
legitimate  users. 


7.  Discussion  and  Research  Directions 

In  this  section  we  discuss  the  utility  of  the  stochastic  process  approach,  and 
suggest  future  research  directions. 

7.1.  On  the  Utility  of  the  Stochastic  Process  Approach 

The  stochastic  process  approach,  introduced  in  [Li  et  al.  11],  can  be  seen  as  a 
new  way  of  thinking  about  quantitative  security.  The  present  paper  is  a  first  step 
toward  eliminating  exponential-distribution  assumptions  and  the  independence 
assumptions  in  [Li  et  al.  11].  The  approach  is  centered  on  stochastic  processes 
that  model  the  interaction  between  attack  and  defense.  The  approach  is  in  its 
infant  stage  and  is  currently  based  on  first-principle  modeling  (due  to  the  lack  of 
real-life  data).  Nevertheless,  the  approach  has  important  real-life  implications, 
as  we  elaborate  below. 

First,  since  stochastic  processes  capture  the  evolution  of  node  states,  the  re¬ 
sults  are  pertinent  to  the  evolution  of  system  states.  This  will  help  deepen  our 
understanding  of  the  problem  and  likely  will  allow  us  to  draw  insights  or  laws 
and  principles  that  govern  the  outcome  of  the  interaction  between  attack  and 
defense.  Note  that  the  characterization  is  not  fundamentally  based  on  knowledge 
of  the  parameters,  because  the  parameters  exist  in  any  case.  In  other  words,  the 
resulting  insights  or  laws  and  principles  are  valid  whether  we  know  the  specific 
values  of  the  model  parameters  or  not.  Such  general  (or  even  universal)  laws  and 
principles  are  of  paramount  importance. 

Second,  the  analytical  results  could  lead  to  guidelines  (as  shown  in  [Li  et 
al.  11])  for  adjusting  the  defense  in  a  cost-effective,  if  not  optimal,  fashion.  As 
mentioned  above,  the  upper  bounds  can  be  utilized  in  practice  to  capture  the 
worst-case  scenario,  and  can  be  adopted  for  decision-making.  For  example,  if  we 
know  that  q  is  bounded  from  above,  then  some  appropriate  proactive  threshold 
cryptosystems  [Herzberg  et  al.  97]  may  be  deployed  to  tolerate  the  bounded 
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compromise.  Moreover,  the  study  can  suggest  quantitative  defense  adjustments 
in  order  to  contain  q  below  a  desired  threshold  (as  shown  in  [Li  et  al.  11]). 

Third,  we  have  offered  here  statistical  methods  for  obtaining  security  measures 
in  practice.  In  Section  4,  we  presented  numerical  methods  for  deriving  the  global 
security  measure  q  in  various  practical  settings.  In  Section  5,  we  presented  sta¬ 
tistical  methods  for  obtaining  the  global  security  measure  q  when  only  high-level 
information  about  the  attack-defense  process  can  be  obtained. 

7.2.  Extending  the  Stochastic  Process  Models 

The  models  in  the  present  paper  can  be  extended  in  multiple  directions.  The  first 
direction  is  to  further  weaken  the  assumptions  as  much  as  possible.  One  question 
is  this:  how  should  we  accommodate  node  and  edge  heterogeneity  (other  than  the 
degree  or  topology  heterogeneity),  meaning  that  different  nodes  (edges)  exhibit 
different  parameter  distribution  characteristics?  Note  that  our  analytical  results 
on  q(v),  the  probability  that  node  v  is  compromised  in  the  steady  state,  already 
accommodate  node  and  edge  heterogeneity.  However,  our  analytical  results  on 
the  global  security  measure  g,  especially  its  bounds,  are  based  on  the  fact  that 
the  nodes  (edges)  exhibit  the  same  parameter  distribution  characteristics. 

Another  question  is  this:  how  can  we  bridge  the  gap  between  what  can  be 
observed  in  practice  (i.e.,  what  can  be  offered  by  real-life  data)  and  the  weakest 
possible  assumptions  that  we  have  to  make  in  order  to  derive  analytical  results? 
This  includes  the  design  of  new  statistical  methods  for  testing  assumptions  that 
currently  cannot  be  tested. 

The  second  direction  is  to  accommodate  multiple  vulnerability  classes  but  from 
a  different  perspective.  Our  models  are  based  on  aggregating  the  effects  of  vulner¬ 
abilities  at  individual  computers.  Another  perspective  is  to  consider  individual 
vulnerability  classes  that  are  “compatible”  with  each  other,  where  “compatibil¬ 
ity”  is  an  intuitive  concept  that  needs  to  be  formalized.  The  intuition  is  that  the 
exploitation  of  some  vulnerability  in  one  computer  can  cause  the  exploitation 
of  another  compatible  (i.e.,  not  necessarily  the  same)  vulnerability  in  another 
computer.  This  would  lead  to  multiple  stochastic  attack-defense  processes  that 
might  be  dependent  on  each  other.  This  quickly  becomes  very  complicated,  but 
certainly  worthy  of  study,  because  it  may  lead  to  deeper  insights.  Moreover,  it 
is  certainly  interesting  and  important  to  compare  the  two  perspectives. 

The  third  direction  is  to  accommodate  the  consequence  heterogeneity  between 
nodes.  This  is  important  because  the  damage  that  is  caused  by  the  compromise 
of  one  node  (e.g.,  a  server)  may  be  greater  than  the  damage  caused  by  the 
compromise  of  another  (e.g.,  a  desktop).  Our  models  can  partially  accommodate 
this  consequence  heterogeneity  via  the  q{v) ’s,  where  q(v )  is  the  probability  that 
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node  v  is  compromised  in  the  steady  state.  However,  our  results  on,  including 
the  bounds  of,  the  global  security  measure  q  do  not  consider  the  asset  on  node  v, 
which  may  be  denoted  by  asset  (v).  One  simple  extension  would  be  to  introduce 
a  derivative  security  measure  q(v)  x  asset(u),  which  can  lead  to  the  naturally 
extended  global  security  measure  J2veyq(v)  x  asset (u)/|V|.  Nevertheless,  other 
extensions  are  possible. 

The  fourth  direction  is  to  characterize  the  distribution  of  the  random  variable 
that  indicates  the  number  of  compromised  nodes  in  the  steady  state.  Note  that 
q  x  |V|  corresponds  to  the  expected  number  of  compromised  nodes  in  the  steady 
state.  Knowledge  about  the  distribution  of  the  random  variable  will  allow  us  to 
conduct  better  decision-making. 

7.3.  Unifying  the  Approaches 

As  mentioned  in  the  introduction,  the  problem  of  quantitative  security  anal¬ 
ysis  of  networked  computer  systems  has  been  outstanding  for  decades.  In  the 
above  section  on  related  work,  we  highlighted  three  main  approaches.  Both  the 
attack-graph  approach  and  the  statistical-physics  approach  have  been  extensively 
studied,  but  both  the  dynamical  system  approach  and  the  stochastic  process  ap¬ 
proach  are  in  their  infant  stages.  Nevertheless,  it  is  imperative  to  unify  these 
approaches  into  a  comprehensive  and  systematic  framework. 


8.  Conclusion 

We  extended  the  quantitative  security  model  and  analysis  presented  in 
[Li  et  al.  11]  by  substantially  weakening  its  assumptions  regarding  the  distribu¬ 
tions  of,  and  the  dependence  between,  the  relevant  random  variables.  In  partic¬ 
ular,  our  extensions  lead  to  practical  methods  for  obtaining  the  desired  security 
measures  for  both  the  case  that  the  lower-level  attack-defense  process  is  ob¬ 
served  and  the  case  that  the  higher-level  alternating  renewal  process  is  observed. 
We  discussed  future  research  directions  toward  ultimately  solving  the  problem 
of  quantitative  security  analysis  of  networked  systems. 
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Site  percolation  has  been  used  to  help  understand  analytically  the  robustness  of  complex  networks  in  the 
presence  of  random  node  deletion  (or  failure).  In  this  paper  we  move  a  further  step  beyond  random  node  deletion 
by  considering  that  a  node  can  be  deleted  because  it  is  chosen  or  because  it  is  within  some  L-hop  distance 
of  a  chosen  node.  Using  the  generating  functions  approach,  we  present  analytic  results  on  the  percolation 
threshold  as  well  as  the  mean  size,  and  size  distribution,  of  nongiant  components  of  complex  networks  under 
such  operations.  The  introduction  of  parameter  L  is  both  conceptually  interesting  because  it  accommodates  a 
sort  of  nonindependent  node  deletion,  which  is  often  difficult  to  tackle  analytically,  and  practically  interesting 
because  it  offers  useful  insights  for  cybersecurity  (such  as  botnet  defense). 
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I.  INTRODUCTION 

Understanding  the  robustness  (or  connectivity)  of  complex 
networks  in  the  presence  of  node  deletion  (or  failure)  is  an 
important  problem  that  has  attracted  much  attention.  Existing 
analytic  studies  mainly  focused  on  the  case  of  randomly 
deleting  nodes  in  random  graphs  with  a  given  arbitrary  degree 
distribution  (the  so-called  configuration  model  [1-5]).  By 
treating  node  deletion  as  site  percolation  on  large  networks 
(i.e.,  in  the  limit  of  network  size),  Newman  et  al.  [3] 
introduced  a  novel  generating  functions  approach  to  tackling 
the  problem.  This  approach  has  been  used  in  various  settings 
(e.g..  Refs.  [6-8]). 

In  this  paper,  we  move  a  further  step  beyond  random 
node  deletion  by  considering  the  following  node  deletion 
operation:  a  node  is  deleted  either  because  it  is  chosen  or 
because  it  is  within  a  chosen  node’s  L-hop  distance  in  terms 
of  shortest  path  between  them  in  unweighted  graphs,  where 
L  ^  0.  Figure  1  illustrates  an  example  scenario  with  L  =  1, 
where,  if  v  is  chosen  to  be  deleted,  then  both  v  and  its  one-hop 
(i.e.,  direct)  neighbors  mi,W2,«3,M4  are  deleted.  We  call  the 
new  model  “L-hop  percolation”  because  the  random  node 
deletion  extensively  studied  in  the  literature  [3-8]  can  be  seen 
as  the  special  case  with  L  =  0.  The  introduction  of  the  new 
parameter  L  is  of  both  conceptual  and  practical  value,  as  we 
discuss  below. 

From  a  conceptual  perspective,  the  introduction  of  L  allows 
us  to  model  and  characterize  a  class  of  nonindependent  node 
deletion  operations.  Previous  analytical  studies  (including 
Refs.  [3-8])  in  this  context  focused  on  independent  node  dele¬ 
tions.  Nonindependent  node  deletions  are  often  challenging 
to  treat  analytically,  but  we  manage  to  present  some  analyt¬ 
ical  results  in  the  specific  nonindependence  setting  of  this 
paper. 

From  a  practical  perspective,  the  nonindependent  node 
deletion  has  real-life  applications.  On  the  one  hand,  an 
adversary  could  certainly  destroy  both  randomly  chosen  nodes 
and  their  neighbors  within  L-hop  distance.  On  the  other  hand, 
analyzing  such  nonindependent  node  deletions  can  lead  to 
useful  insights  for  defending  cyberspace.  To  see  this,  consider 
an  example  of  botnet  defense.  A  botnet  is  a  network  of 
computers  (called  hots  or  zombies)  that  are  compromised 


by  a  computer  malware/virus  and  under  the  control  of  a 
bot-master  (the  human  attacker).  Peer-to-peer  (P2P)  botnets 
[9]  are  arguably  the  most  powerful  cyberattacks  that  are 
very  difficult  to  defend  against.  This  is  because  a  hot  only 
“knows”  (i.e.,  can  directly  communicate  with)  a  few  others, 
which  suffices  their  needs  of  forwarding  the  bot-master’s 
command-and-control  messages.  In  real  life,  it  is  very  difficult 
to  obtain  the  precise  structures  of  P2P  botnets  because  of 
various  technical  issues  and  ethics/privacy  concerns  [10,11]. 
This  means  that  the  defender  cannot  expect  to  identify  all 
the  hots  belonging  to  a  botnet  and  then  eliminate  them  as 
a  whole.  Instead,  it  is  more  realistic  that  the  defender,  first, 
detects  some  hots  and  then  traces  the  hots  that  communicated 
with  the  detected  ones  (we  call  this  method  “detect  and  then 
trace”).  However,  the  defender’s  capability  in  tracing  hots 
are  limited  by  various  reasons.  For  example,  botmasters  have 
abused  cryptographic  techniques  for  making  P2P  botnets — 
such  as  Conficker,  Nugache,  and  Storm  and  its  successor, 
Waledac — more  stealthy  [9,12-15].  Moreover,  the  Waledac 
botnets  actually  do  not  spread  (propagate)  using  a  scanning 
technique  that  is  relatively  easy  to  defend  against  but  rather 
spread  (propagate)  via  social-engineering  methods  that  are 
much  more  difficult  to  deal  with  [16],  Because  P2P  botnets 
are  naturally  complex  networks,  the  defender’s  limited  tracing 
capability  in  combating  P2P  botnets  naturally  motivates  the 
concept  of  L-hop  percolation.  Note  that  L  —  0  corresponds  to 
the  fact  that  the  defender  can  only  directly  detect  (and  thus 
delete)  some  hots,  and  L  ^  0  corresponds  to  the  fact  that  the 
defender  not  only  can  directly  detect  (and  thus  delete)  some 
hots  but  also  can  trace  to  (and  thus  delete)  the  hots  within  their 
L-hop  neighborhood. 

The  present  paper  aims  to  understand  and  characterize  the 
robustness  of  complex  networks  with  L  >  0  when  compared 
to  the  case  with  L  —  0.  Specifically,  we  consider  L-hop  site 
percolation  on  random  graphs  with  arbitrary  node  degree 
distributions  (in  the  configuration  model).  Using  the  gener¬ 
ating  functions  approach  [3,6],  we  present  analytical  results 
on  the  percolation  threshold,  and  mean  size  as  well  as  size 
distribution  of  the  nongiant  components.  The  analytical  results 
are  confirmed  by  our  simulation  study  on  power-law  random 
graphs  and  Erdos-Renyi  random  graphs. 
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FIG.  1 .  An  example  scenario:  when  L  =  0,  a  randomly  chosen 
node  v  is  deleted  (this  is  the  case  extensively  studied  in  the  literature); 
when  L  —  1,  a  randomly  chosen  node  v  and  its  one-hop  neighbors 
ui,ii2,Ui,U4  are  deleted.  As  defined  in  the  text,  Af^o(v)  =  [W  and 

A/^l(u)  =  {v,Ui,U2,U3,U4}. 


II.  SYSTEM  MODEL  AND  ANALYTIC  RESULTS 

Let  G  =  (V.E)  be  a  complex  network  (or  graph),  where 
i>  e  V  is  the  set  of  nodes  (vertices)  and  E  is  the  set  of  edges  (we 
focus  on  undirected  unweighted  graphs).  Let  distanced ,  v)  be 
the  length  of  the  shortest  path  between  u  e  V  and  v  e  V  in  G, 
where  distanced, u)  =  0.  Define  the  (’-hop  neighborhood  of 
node  v  as  AT^e(v)  =  {u  :  u  e  V, distanced, u)  ^  £},  namely 
the  set  of  nodes  that  are  no  more  than  £ -hop  away  from  v.  The 
following  parameters  are  used  in  our  model: 

(1)  L:  This  is  upper  bound  on  the  number  of  hops  the 
deletion  operation  can  reach  out  to  when  starting  at  a  randomly 
chosen  node.  Note  that  L  indicates,  in  a  sense,  the  degree  of 
nonindependence  of  the  nodes  that  are  deleted  (intuitively, 
the  larger  the  L,  the  stronger  the  nonindependence)  and  can 
represent  (for  instance)  the  defender’s  capability  in  tracing 
hots  in  the  above  example  of  botnet  defense. 

(2)  (ro,n, . . .  ,rL)  with  £f=0  ry  =  1:  For  a  node  v  and  its 
L-hop  neighborhood  A/£(iO,  the  simplest  scenarios  is  to 
delete  all  the  nodes  belonging  to  A/£( v),  which  is  a  special 
case  we  mentioned  above.  To  make  our  model  more  generally 
applicable  while  reflecting  the  fact  mentioned  above — the  farer 
from  v  the  neighbors,  the  less  likely  the  defender  can  trace  to 
them — we  need  a  representation  of  detection  probability  that 
decreases  as  the  distance  from  a  chosen  node  grows.  For  this, 
we  use  vector  (r0,rt , . . .  ,rL)  to  represent  a  class  of  deletion 
“strategies”  such  that  a  node  u,  which  is  distance(M,i>)-hop 
away  from  a  chosen  node  v,  will  be  deleted  with  probability 
£t=distance(u  v)  r<  ■  Because  the  chosen  node  v  will  be  deleted 
with  probability  ££_0  ry,  we  naturally  require  £^=0/7  =  1, 
meaning  that  v  will  always  be  deleted.  This  explains  why 
the  extensively  studied  case  with  L  —  0  is  a  special  case  of 
the  model  with  L  ^  0  in  this  paper.  In  order  to  see  that  the 
representation  corresponds  to  a  class  of  deletion  operations,  we 
note  that  the  scenario  of  L  =  0  shown  in  Fig.  1  corresponds  to 
strategy  (ro)  =  (1),  and  the  scenario  of  L  —  I  corresponds  to 
strategy  (ro,n)  =  (0,1).  Continuing  on  the  example  of  L  =  1 
but  with  a  different  strategy  (0  <  ro  <  1,0  <  r\  <  1)  where 
ro  +  r\  =  1,  we  observe  that  node  v  will  be  deleted  with 
probability  ro  +  r i  =  1  but  every  node  belonging  to  AT  |  (v)  — 
A/”^o(v)  =  {iti,U2,U3,U4j  will  be  deleted  with  probability  n. 

(3)  pk-  This  is  the  probability  that  a  randomly  chosen  node 
has  degree  k. 

(4)  q,  q qc:  q  is  the  probability  an  appropriately  chosen 
node  is  occupied  (i.e.,  not  deleted),  and  1  —  q  is  the  probability 
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that  node  is  unoccupied  (i.e.,  deleted).  In  general  0  ^  q  ^  1 
because,  for  example,  the  curing  of  a  bot  may  not  be  perfect 
[17],  Correspondingly,  qk  is  the  probability  that  a  node  of 
degree  k  is  occupied,  and  qc  is  the  percolation  threshold  we 
want  to  derive,  above  which  a  giant  component  emerges. 

Suppose  x  ^  1.  Following  the  generating  functions  ap¬ 
proach  [3],  Go(*)  =  TZo  Pkxk  and  G,(x)  =  G'0(x)/G'0(  1) 
are  the  generating  functions  of  nodes’  degree  distribution 
and  nodes’  excess-degree  distribution,  respectively.  For  any 
node  v  e  V,  the  generating  function  for  the  size  distribution 
of  Af&fr)  -  is  Gmto  :=  Go[Gi(-  ■  •  Gi(x)  ■■■)]> 

with  £  —  1  iterations  of  the  function  G\  acting  on  itself. 
For  a  given  L ,  for  each  0  ^  ^  L  we  can  define  a  “base” 

strategy  baseStrategyf  =  (r0, . . .  ,r L),  where  r;  =  1  if  j  —  £ 
and  r j  —  0  otherwise.  (This  is  for  deriving  some  intermediate 

(£) 

formulas  that  will  be  used  to  derive  the  final  result.)  Let  qk 
be  the  probability  that  a  node  of  degree  k  is  occupied  (i.e., 
not  deleted)  with  respect  to  baseStrategyf .  Note  that  the 
component  structure  in  graph  G  is  essentially  treelike  because 
the  probability  that  a  component  contains  a  closed  loop  goes  as 
/V  ~  and  is  therefore  negligible  in  the  limit  of  large  N.  Then, 
for  a  node  v  of  degree  k,  we  have  qj,0>  =  q,  qk1]  =  qk+l ,  and 
for  £  ^  2, 


(£) 

qk 


$1=0  Y£_1=0 


l- 1 


£•••  e  k+i+E-un 


7=1 


dSj  G(j+l\x) 
Sj !  dxsi 


,  (1) 


where  Sj  is  the  number  of  u’s  distance-) j  +  1)  neighbors 
[i.e.,  Sj  is  the  size  of  the  set  {u  :  distanced, u)  =  j  + 
1}],  d '  G,  j+Jj(X)  lx=o  is  the  probability  that  v  has  exactly  Sj 
distance-)^  +  1)  neighbors  (based  on  the  property  of  deriva¬ 
tives  of  generating  functions  for  degree  distributions  [3,18]), 
n£l  ~  jCi  >dx°j'X'>  U=o  is  the  probability  that  v  has  respec¬ 
tively  Sj  distanc e-(y  +  1)  neighbors  for  j  =  —  1,  and 

qk+'+Y.'i=l  ■'•■j  n^A1]  d  U=o  is  the  probability  that  v  is 

occupied  (i.e.,  not  deleted)  with  respect  to  baseStrategy,,  for 
a  specific  £. 

Given  qk  for  £  =  0.1,...  and  strategy  (r\, . . .  ,r^),  we 
obtain 

dk  =  Y^qf)rt.  (2) 

1=0 


The  probability  that  any  node  of  degree  k  is  occupied  is 
given  by  pip/k,  and 

OO 

Fo(x)  =  ^2  PkQkXk  (3) 

k= 0 


is  the  probability  generating  function  for  this  distribution  [18]. 
Since  Fo(l)  is  the  probability  that  a  randomly  chosen  node  is 
occupied,  it  can  be  viewed  as  the  portion  of  occupied  nodes. 

As  in  Refs.  [3,6],  if  we  start  at  a  randomly  chosen  node  and 
follow  each  of  its  edges  to  reach  its  neighbors,  the  distribution 
of  the  other  edges  of  each  node  arrived  and  occupied  is 
generated  by 


Fi(x)  = 


IXo  kPkVkXk  1 

Etokpk 


Fp(x) 

z 


(4) 


where  z  =  (k)  =  Gq(1)  is  the  average  node  degree. 
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Let  H\(x)  be  the  generating  function  for  the  distribution  of 
the  sizes  of  percolation  components  with  occupied  nodes  that 
are  reached  by  choosing  a  random  edge  and  following  it  to  one 
of  its  ends.  With  the  same  sort  of  reasoning  as  in  Ref.  [7],  we 
derive  that  H\{x)  satisfies  the  self-consistency  condition 

Hl(x)=l-Fl(l)  +  xF1[Hl(x)].  (5) 


e~xXk/k\  for  k  ^  0.  From  Eq.  (8),  we  see  that  the  percolation 
threshold  qc  is  determined  by 


where  qk  is  given  by  (2)  and  (1). 


Similarly,  the  distribution  for  the  size  of  the  component  to 
which  a  randomly  chosen  node  belongs  is  generated  by 

H0(x)  =  1  -  F0(l)  +  xF0[Hdx)l  (6) 

which,  in  conjunction  with  Eqs.  (l)-(5),  can  be  used  to 
determine  some  quantities  of  interest  such  as  the  /.-hop 
percolation  threshold,  the  mean  size,  and  size  distribution  of 
nongiant  components.  In  what  follows  we  present  the  details. 


B.  Mean  size  of  nongiant  components 

Because  Hq(x)  generates  the  size  distribution  of  nongiant 
components,  Hq(1)  takes  the  value  of  I  —  .S',  where  S  is  the 
fraction  of  nodes  belonging  to  some  giant  component  (equiv¬ 
alently,  S  is  the  fraction  of  giant  components).  Therefore, 

S  =  1  -  H0(l)  =  F0(l)  -  F0(w),  (13) 

where  u  =  H\(l)  is  the  smallest  non-negative  solution  of 


A.  L-hop  percolation  threshold 

From  Eqs.  (5)  and  (6),  we  obtain  that  in  the  absence  of 
giant  components,  the  mean  size  of  components  to  which  a 
randomly  chosen  node  belongs  is  given  by 


(s)  =  Hq{1)  =  Fo(l)  +  Fo(l)//j'(l) 

^o(D2 


=  F0(l)  + 


(7) 


The  above  expression  diverges  at  z  =  /’)"( 1 ),  which  cor¬ 
responds  to  the  percolation  threshold  qc  (i.e.,  the  critical 
occupation  probability)  where  a  giant  component  emerges. 
Hence,  qc  is  determined  by 


u=l-Fl(l)+Fl(u).  (14) 

In  general,  the  mean  size  of  the  component,  excluding  the 
infinite  giant  component,  to  which  a  randomly  chosen  vertex 
belongs  can  be  expressed  as 

_  ff'(l)  _  H0[Hl(l)]  + 

Ho(l)  Ho(l) 

=  Fq(M)[-  -  F"(l)]  +  F^U)1 

(1  -  S)[z  -  F"(l)]  ’  K  ’ 

which  is  equivalent  to  Eq.  (7)  when  there  is  no  giant  component 
(i.e.,  S  =  0,  u  =  1). 


^2 kpk  =  ^2  k(k  -  (g) 

k—0  k= 0 


where  qk  is  given  by  Eq.  (1).  In  what  follows  we  instantiate 
the  above  general  qc  in  the  special  cases  of  regular  graphs  and 
of  the  classical  Erdos-Renyi  random  graphs. 

In  the  case  of  cf-regular  graphs  G  on  n  nodes,  where  cl  is 
a  positive  integer,  we  have  the  degree  distribution  pd  =  1  and 
pk  =  0  for  k  ^  d.  From  Eqs.  (1)  and  (2)  we  get 


qd 


where  for  l  —  0, . . .  ,L, 


=  EA, 


(9) 


1=0 


=  £■■•£ 

si=0  sj-i=0 


(J+i+E':’,-,  jq  <'"GU+1>W 


■?;!  dxsi 
j= 1  J 


x=0 


(10) 


C.  Size  distribution  of  nongiant  components 

Let  7ts  be  the  probability  of  a  randomly  chosen  node 
belonging  to  a  component  of  size  s.  We  can  then  rewrite  the 
generating  function  Hod)  as 

OO 

Hod)  =  7T0  +  ^2  7TSXS.  (16) 

5=1 


It  is  straightforward  to  see  that  7Tq  =  1  —  Fo(l)  and  7Tj  = 
Ho(l)  —  ttq  —  X!^=27rs>  where  Ho(l)  can  be  determined  by 
Eqs.  (13)  and  (14).  For  .v  ^  2,  we  have  from  Eqs.  (16)  and  (6) 


7VS 


1 

(S  -  1)! 


ds~l 

’H0d)  -  7T0  ’ 

1 

dxs~l 

X 

J 

x=0 


j2w^iFo[Hi0cm^ 

d»Hld)]|^. 


(17) 


By  utilizing  Eq.  (8),  we  see  that  the  percolation  threshold  qc  By  using  the  Cauchy  formula  for  the  higher-order  derivative 
can  be  calculated  from  of  a  function  [20], 


\={d-\)qd.  (11) 

In  the  case  of  the  classical  Erdos-Renyi  random  graphs 
G(n,X/n)  for  some  X  >  0  [19],  the  average  degree  for  each 
vertex  is  X  and  the  degree  distribution  is  given  by  pk  = 


<n_  _  n\  r  f(z)  d 

dzn  z=zo  2 jti  J  (z-  Zo)'!+1 


(18) 


where  the  integral  is  around  a  contour  that  encloses  zo  in  the 
complex  plane  but  encloses  no  poles  in  fix)-  Equation  (17) 
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(a)  t  =  2  (b)  t  =  2.5 


FIG.  2.  (Color  online)  Percolation  threshold  qc  in  the  case  of  power-law  graphs.  Simulation  results  are  plotted  with  respect  to  L  =  0  with 
strategy  (r0)  =  (1)  (squares),  L  =  1  with  strategy  (ro.L)  =  (0,1)  (circles),  L  =  2  with  strategy  (r0, T| ,  r2)  =  (0,0, 1)  (triangles),  and  L  =  3  with 
strategy  (lo,  11,12,13)  =  (0,0,0, 1)  (crosses).  The  insets  show  a  magnified  view  of  the  curves  for  L  =  1,  L  =  2,  and  L  =  3.  In  the  simulation  we 
set  the  threshold  for  giant  component  as  1000  nodes.  The  plots  of  simulation  results  correspond  to  the  average  of  100  independent  simulation 
runs.  Exact  solutions  (solid  curves)  are  obtained  from  Eq.  (8). 


then  becomes 


1 

2ni(s  —  1) 

Z 

2ni{s  —  1) 


Fp[//l(x:)]  dH\(x) 

jr5-1  dx 


dx 


dHu 


(19) 

(20) 


where  the  integral  in  Eq.  (19)  is  around  an  infinitesimal  contour 
around  the  origin  in  the  complex  plane,  and  the  integral 
in  Eq.  (20)  is  an  infinitesimal  loop  around  1  —  Fi(l)  since 
H\(x)  — >■  1  —  Ei(l)  as  x  — >  0. 

By  treating  x  as  a  function  of  Hi,  we  plug  Eq.  (5)  into 
Eq.  (20)  to  obtain 


7TS 


[FdHi)Y 


2jt i (s  -  1)  J  [Hi  —  1  +  Fi(l)Y 


-dHi.  (21) 


By  applying  the  Cauchy  formula  to  Eq.  (21),  we  obtain  the 
desired  result 


Ttc  — 


(s  -  D! 


js—2 


dx 


s — 2 


[Fiixf] 


(22) 


x=l- Fi(l) 


III.  SIMULATION  RESULTS 

A.  Simulation-based  confirmation  of  the  percolation  thresholds 

In  order  to  verify  the  theoretical  results,  we  first  conducted 
simulations  on  random  graphs  at  the  order  1  000  000  nodes 
with  node  degrees  distributed  according  to  the  truncated  power 
law  p o  =  0  and 


Pk  =  Ck~'e~k/x ,  1,  (23) 

where  C,  t,  and  x  are  constants.  The  reason  for  choosing 
this  distribution  is  twofold.  On  one  hand,  this  distribution 
is  seen  in  various  real-life  networks  including  the  Internet 
[21],  collaboration  networks  of  movie  actors  and  scientists 


[22,23].  On  the  other  hand,  the  exponential  cutoff  makes  the 
distribution  normalizable  for  all  t  so  the  generating  functions 
and  their  derivatives  are  finite.  In  our  simulation  study  we 
set  0  ^  L  ^  3  and,  for  simplicity,  strategy  (ro, . . .  ,/y  )  = 
(0, . . .  ,0, 1).  We  consider  two  scenarios  of  t  =  2  and 
t  =  2.5. 

Figure  2  plots  the  percolation  threshold  qc  obtained  from 
our  simulations,  along  with  the  exact  solutions  obtained 
from  Eq.  (8).  We  make  the  following  observations.  First, 
the  simulation  results  agree  with  their  analytical  counterparts. 
Second,  there  is  a  huge  gap  between  the  percolation  thresholds 


FIG.  3.  (Color  online)  Percolation  threshold  in  the  case  of  Erdos- 
Renyi  graphs.  Simulation  results  are  plotted  with  respect  to  L  =  0 
with  strategy  (r0)  =  (1)  (squares),  L  =  1  with  strategy  (roJO  =  (0,1) 
(circles),  L  =  2  with  strategy  (r0,  T! ,  r2)  =  (0.0, 1)  (triangles),  and  L  = 
3  with  strategy  (ro,ri  J2J3)  =  (0,0,0, 1)  (crosses).  In  the  simulation 
we  set  the  threshold  for  giant  component  as  1000  nodes.  The  plots 
of  simulation  results  correspond  to  the  average  of  100  independent 
simulation  runs.  Exact  solutions  (solid  curves)  are  obtained  from 
Eq.  (12). 
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with  respect  to  L  —  0  and  L  =  1 .  For  example,  consider  the 
case  of  t  —  2  with  x  =  50.  When  L  =  0,  giant  component 
disappears  only  after  randomly  deleting  90%  nodes;  whereas, 
when  L  —  1,  giant  component  disappears  after  deleting 
12%  nodes,  including  both  the  randomly  chosen  nodes  and 
their  direct  neighbors.  This  means,  in  the  botnet  defense 
example,  that  “detect  and  then  trace”  could  be  much  more 
effective. 

Third,  there  is  no  big  difference  between  the  thresholds  cor¬ 
responding  to  L  =  1,2,3,  respectively.  This  phenomenon  can 
be  interpreted  as  follows:  In  networks  with  a  power-law  degree 
distribution,  most  nodes  have  small  degrees  and  only  a  small 


number  of  nodes  (called  hubs)  have  large  degrees.  Therefore, 
a  randomly  chosen  node  probably  has  a  neighbor  that  is  a  hub, 
meaning  that  when  L  =  1  with  strategy  (loTi)  =  (0, 1),  a  hub 
could  be  deleted  even  after  choosing  a  small  number  of  nodes. 
This  explains  the  abrupt  change  in  the  percolation  threshold  in 
the  case  of  L  —  1  with  strategy  (To ,  Ti )  =  (0,1)  when  compared 
with  the  percolation  threshold  in  the  case  of  L  —  0  with 
r0  =  1.  The  fact  that  L  —  1  already  allows  the  deletion  of 
some  hub  nodes  further  explains  why  increasing  L  further 
may  not  significantly  increase  the  percolation  threshold.  This 
means  that  tracing  one-hop  neighbors  is  almost  as  powerful 
as  tracing  multihop  neighbors.  We  mention  that  L  =  1  with 
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FIG.  4.  (Color  online)  Fraction  of  giant  components  S  and  mean  size  of  nongiant  components  (j)  in  power-law  random  graphs  with  t  =  2 
and  x  =  20.  Simulation  results  are  plotted  with  respect  to  (a)  L  =  0  with  strategy  (r0)  =  (l)(  squares),  (b)  L  =  1  with  strategy  (r0 ,  E )  =  (0, 1 ) 
(circles),  (c)  L  =  2  with  strategy  ( r0 ,  r ! ,  r2 )  =  (0.0, 1)  (triangles),  and  (d)  L  =  3  with  strategy  (r0,  Ti ,  r2,  r3)  =  (0,0,0, 1)  (crosses).  In  the  simulation, 
we  set  the  threshold  for  giant  component  as  1000  nodes.  The  plots  of  simulation  results  correspond  to  the  average  of  100  independent  simulation 
runs.  Exact  solutions  (solid  curves)  are  obtained  from  Eqs.  (13),  (14),  and  (15). 
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strategy  (loTi)  =  (0,1)  is  somewhat  related  to  the  method 
known  as  “acquaintance  immunization”  [24],  which  consists 
of  choosing  a  random  node  and  deleting  a  random  one-hop 
neighbor  of  the  chosen  node  (the  chosen  node  itself  is  not 
deleted).  It  was  shown  in  Ref.  [24]  that  the  effectiveness  of  this 
method  (especially  on  power-law  networks)  comes  from  the 
fact  that  with  high  probability  the  randomly  chosen  neighbor 
of  a  randomly  chosen  node  is  a  hub,  and  therefore  the  hubs  are 
knocked  out  after  a  small  number  of  such  operations.  In  this 
paper,  the  same  effect  causes  the  dramatic  increase  in  qc  when 
changing  L  —  0  to  L  =  1 . 


PHYSICAL  REVIEW  E  00,  001100  (2011) 

We  then  conducted  simulations  on  the  classic  Erdos-Renyi 
graphs  G(n,p)  with  n  —  1  000000  nodes  and  link  probability 
p  =  X/n.  Figure  3  plots  the  percolation  threshold  qc  obtained 
from  our  simulations,  along  with  the  exact  solutions  obtained 
from  Eq.  (12).  Similar  qualitative  behavior  can  be  observed  as 
in  the  case  of  power-law  graphs,  with  two  noticeable  differ¬ 
ences:  the  difference  between  the  thresholds  corresponding  to 
L  =  0  and  L  =  1,2,3  trails  off,  while  the  differences  between 
L  =  1,2,3  become  more  apparent.  These  phenomena  can 
be  explained  by  that  Erdos-Renyi  random  graphs  have  more 
homogeneous  degree  distributions  than  power-law  graphs. 


(ft 


(b)  L  =  1 

<st 


(c)  L  =  2  (d)  L  =  3 

FIG.  5.  (Color  online)  Fraction  of  giant  components  S  and  mean  size  of  nongiant  components  ( s )  in  power-law  random  graphs  with 
t  =  2.5  and  x  =  20.  Simulation  results  are  plotted  with  respect  to  (a)  L  =  0  with  strategy  (r0)  =  (1)  (squares),  (b)  L  =  1  with  strategy 
(r0,rj)  =  (0,1)  (circles),  (c)  L  =  2  with  strategy  (r0,rlJ2)  —  (0.0, 1)  (triangles),  and  (d)  L  =  3  with  strategy  (r0 ,  T|  ,r2,r3)  =  (0,0,0, 1)  (crosses). 
In  the  simulation,  we  set  the  threshold  for  giant  component  as  1000  nodes.  The  plots  of  simulation  results  correspond  to  the  average  of  100 
independent  simulation  runs.  Exact  solutions  (solid  curves)  are  obtained  from  Eqs.  (13),  (14),  and  (15). 
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FIG.  6.  (Color  online)  Fraction  of  giant  components  S  and  mean  size  of  nongiant  components  {s)  in  Erdos-Renyi  random  graphs  with  X  =  15. 
Simulation  results  are  plotted  with  respect  to  (a)  L  =  0  with  strategy  (r0)  =  (1)  (squares),  (b)  L  =  1  with  strategy  (r0,ri)  =  (0,1)  (circles), 
(c)  L  =  2  with  strategy  (OTi  J2)  =  (0,0,1)  (triangles),  and  (d)  L  =  3  with  strategy  (roTi.H.H)  =  (0,0,0, 1)  (crosses).  In  the  simulation  we  set 
the  threshold  for  giant  component  as  1000  nodes.  The  plots  of  simulation  results  correspond  to  the  average  of  100  independent  simulation  runs. 
Exact  solutions  (solid  curves)  are  obtained  from  Eqs.  (13),  (14),  and  (15). 


B.  Simulation-based  confirmation  of  the  fraction  S  of  giant 
components  and  the  mean  size  (s)  of  the  nongiant 
components 

For  this  purpose,  we  conducted  simulations  on  the  same 
power-law  random  graphs  with  distribution  (23)  and  the  same 
Erdos-Renyi  random  graphs  G(n,X/n)  with  n  =  1000000. 
Figure  4  and  Fig.  5  plot  the  cases  of  power-law  graphs  with 
t  =  2,  x  —  20  and  t  —  2.5,  x  =  20,  respectively.  Figure  6  plots 
the  case  of  Erdos-Renyi  random  graphs  with  1=15.  From 
them  we  draw  the  following  observations. 

First,  we  observe  that  when  the  occupation  probability  q  < 
qc,  the  fraction  of  giant  components  is  S  —  0.  Then,  S  increases 
with  q  when  q  Js  q,  for  all  four  strategies 


(r, 0, . . .  jl)  —  (0,  . . .  ,0, 1)  with  respect  to  L  —  0,  1, 2,  3.  More 
interestingly,  from  another  perspective,  we  observe  that  as 
the  number  of  chosen  nodes  increases,  the  fraction  of  giant 
components  decreases  abruptly  in  the  cases  of  L  >  1  but 
decreases  smoothly  in  the  case  of  L  —  0.  This  illustrates  the 
power  of  deletion  of  even  one-hop  neighbors.  Moreover,  the 
giant  component  disappears  faster  in  the  case  of  power-law 
graphs  than  in  the  case  of  Erdos-Renyi  random  graphs.  For 
example,  in  the  case  of  L  =  1,  the  giant  component  disappears 
after  deleting  about  10%  (20%)  of  nodes  and  their  one-hop 
neighbors  in  the  case  power-law  graphs  with  t  =  2.5  as  shown 
in  Fig.  5(b)  [in  the  case  of  Erdos-Renyi  random  graphs  as 
shown  in  Fig.  6(b)] .  This  can  be  explained,  as  mentioned  above. 
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(a)  Power-law  graphs  with  t  —  2,  x  —  20 


(b)  Power-law  graphs  with  t  =  2.5,  x  —  20 


FIG.  7.  (Color  online)  Size  distribution  for  nongiant  components  ns  in  power-law  random  graphs.  Simulation  results  are  plotted  with 
respect  to,  from  left  to  right,  L  =  0  with  strategy  (r0)  =  (1)  (squares),  L  =  1  with  strategy  (to.lY)  =  (0,1)  (circles),  L  =  2  with  strategy 
(to.r1.r2)  =  (0,0,1)  (triangles),  and  L  =  3  with  strategy  (rbTi^D)  =  (0,0,0, 1)  (crosses).  In  the  simulation  we  set  in  (a)  the  occupation 
probabilities  q  =  0.5,0.92,0.95,0.98  to  keep  the  fraction  of  giant  components  S  ^  0.1 1  in  all  four  cases;  we  set  in  (b)  q  =  0.65,0.95,0.96,0.98 
to  keep  the  fraction  of  giant  components  S  *=»  0.09  in  all  four  cases.  The  simulation  results  correspond  to  the  average  of  100  independent 
simulation  runs.  Exact  solutions  (solid  curves)  are  obtained  from  Eq.  (22). 


by  the  fact  that  in  the  case  of  power-law  graphs,  it  is  relatively 
easier  to  hit  the  hubs  because  they  are  the  neighbors  of  many 
nodes. 

Second,  when  q  approaches  the  critical  point  qc,  the  mean 
size  of  nongiant  components  (s)  increases  abruptly.  This 
agrees  with  our  theory.  In  fact,  when  q  —  qc  we  obtain  z  = 
F"(l)  in  Eq.  (21).  By  using  Eq.  (15),  we  obtain  \m\q^qc  ( s )  = 
00.  The  percolation  thresholds  qc  shown  in  Fig.  4,  Fig.  5, 
and  Fig.  6  agree  with  those  in  Fig.  2(a),  Fig.  2(b),  and 
Fig.  3.  By  comparing  Fig.  4  and  Fig.  5  with  Fig.  6,  we  make 
two  observations,  (i)  In  power-law  graphs,  typical  nongiant 
components  have  multiple  nodes.  In  Erdos-Renyi  random 
graphs,  however,  most  nongiant  components  are  isolated  nodes 
because  ( s )  is  always  less  than  1  (when  unoccupied  nodes 
are  treated  as  components  of  size  zero),  (ii)  The  analytical 
and  numerical  results  match  better  in  the  case  of  Erdos-Renyi 
random  graphs  than  in  the  case  of  power-law  graphs,  due  to 
the  inhomogeneity  of  power-law  degree  distribution. 

C.  Simulation-based  confirmation  of  the  size  distribution  ns  of 
nongiant  components 

To  illustrate  this,  we  conducted  simulations  on  the  same 
power-law  random  graphs  with  distribution  (23)  and  the  same 
Erdos-Renyi  random  graphs  G(n,k/n)  with  n  —  1000000. 
We  plot  the  log-linear  diagrams  in  Fig.  7  for  power-law  graphs 
and  in  Fig.  8  for  Erdos-Renyi  random  graphs. 

In  Fig.  7(a)  and  Fig.  7(b),  we  keep  the  respective  fractions 
of  giant  components  at  S  0. 1 1  and  S  ~  0.09  in  all  strategies 
L  =  0, 1,2,3  (this  can  be  achieved  by  choosing  different 
occupation  probabilities  q).  In  Fig.  8,  we  keep  the  fraction 
of  giant  components  at  S  0.07  in  all  strategies  L  —  0, 1,2,3 
(also  by  choosing  different  occupation  probabilities  q). 


By  comparing  Fig.  7  and  Fig.  8,  we  observe  the  following 
discrepancy  between  power-law  graphs  and  Erdos-Renyi 
graphs.  For  power-law  graphs  with  the  same  fraction  of 


Size  of  non-giant  component  s 

FIG.  8.  (Color  online)  Size  distribution  for  nongiant  components 
ns  in  Erdos-Renyi  random  graphs  with  k  =  15.  Simulation  results 
are  plotted  with  respect  to,  from  left  to  right,  L  =  0  with  strategy 
r0  =  1  (squares),  L  =  1  with  strategy  (roJi)  =  (0,1)  (circles),  L  = 
2  with  strategy  (r0, Ti , r2)  =  (0,0,1)  (triangles),  and  L  =  3  with 
strategy  (ro,ri,r2,r3)  =  (0,0,0, 1)  (crosses).  In  the  simulation  we  set 
the  occupation  probabilities  q  =  0.1,0.87,0.99,0.994  to  keep  the 
fraction  of  giant  components  S  ^  0.07  in  all  four  cases.  The  plots 
of  simulation  results  correspond  to  the  average  of  100  independent 
simulation  runs.  Exact  solutions  (solid  curves)  are  obtained  from 
Eq.  (22). 
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giant  components,  the  tails  of  the  size  distributions  of  the 
nongiant  components  in  the  case  of  L  >  0  are  heavier  than 
their  counterpart  in  the  case  of  L  =  0.  This  means  that,  for  a 
fixed  fraction  of  giant  components,  the  sizes  of  the  nongiant 
components  diverge  as  L  grows.  In  contrast,  for  Erdos-Renyi 
graphs  with  the  same  fraction  of  giant  components,  the  tails 
of  the  size  distributions  of  the  nongiant  components  in  the 
case  of  L  >  0  are  lighter  than  their  counterpart  in  the  case 
of  L  =  0.  This  means  that  for  a  fixed  fraction  of  giant 
components,  the  sizes  of  the  nongiant  components  diverge 
as  L  decreases.  Precisely  explaining  this  discrepancy  is  left 
for  future  work,  which  might  lead  to  the  characterization  of 
differences  between  the  two  popular  topologies  from  the  new 
perspective.  It  is  also  worthwhile  to  note  that  the  error  in  the 
case  of  L  =  1  is  more  significant  than  its  counterpart  in  the 
other  cases,  which  hints  that  L  —  1  is  a  very  critical  case  (as 
we  have  already  seen  in  Fig.  3).  Analytically  explaining  this 
phenomenon  is  another  interesting  future  work. 

IV.  CONCLUSION  AND  DISCUSSION 

We  introduced  the  notion  of  L-hop  percolation,  which  is  of 
conceptual  value,  because  it  captures  a  sort  of  nonindependent 
node  deletion,  and  of  practical  value,  because  it  leads  to 
insights  for  cybersecurity  (e.g.,  botnet  defense).  We  presented 
analytic  results  on  the  L-hop  percolation  threshold  as  well  as 
the  mean  size  and  size  distribution  of  nongiant  components 
of  complex  networks  under  such  operations.  In  particular,  we 
highlight  the  following  findings:  (a)  Being  able  to  trace  to 
one-hop  neighbors  is  “almost”  as  powerful  as  being  able  to 
trace  to  multihop  neighbors,  especially  in  power-law  networks, 
(b)  Most  nongiant  components  in  the  case  of  Erdos-Renyi 


random  graphs  are  isolated  vertices,  whereas  typical  nongiant 
components  in  the  case  of  power-law  graphs  contain  multiple 
edges.  Moreover,  as  the  number  of  chosen  nodes  increases,  the 
giant  component  disappears  abruptly  in  the  cases  of  L  >  0  but 
much  more  smoothly  in  the  case  of  L  =  0.  (c)  The  tail  of  the 
distribution  has  a  sharp  difference  between  L  —  0  and  L  >  0 
for  both  Erdos-Renyi  random  graphs  and  power-law  random 
graphs  when  the  giant  component  fraction  is  kept  as  a  constant. 

The  notion  of  L-hop  percolation  brings  a  range  of  in¬ 
teresting  problems  for  future  research.  In  addition  to  those 
mentioned  in  the  text,  here  are  more  examples.  What  con¬ 
clusions  we  can  draw  when  considering  other  types  of  node 
deletion  strategies?  If  we  define  some  cost  functions  for  node 
deletions  corresponding  to  L  =  0  and  L  >  0,  respectively, 
how  can  we  determine  the  optimal  L  value?  In  addition, 
one  reviewer  suggested  the  following  interesting  research 
problems:  What  is  the  degree  distribution  of  the  deleted  nodes? 
Does  it  strongly  differ  from  that  found  for  L  =  0?  What 
happens  on  networks  without  such  strong  anticorrelations? 
What  if  the  selection  operations  are  targeted  toward  certain 
nodes?  For  the  power-law  degree  distributions,  is  there  a  kind 
of  power-law  distribution  when  the  occupation  probability  q 
is  very  close  to  the  critical  point  q,  ? 
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Understanding  the  dynamics  of  computer  virus  (malware,  worm)  in  cyberspace  is  an  important  problem  that 
has  attracted  a  fair  amount  of  attention.  Early  investigations  for  this  purpose  adapted  biological  epidemic 
models,  and  thus  inherited  the  so-called  homogeneity  assumption  that  each  node  is  equally  connected  to 
others.  Later  studies  relaxed  this  often  unrealistic  homogeneity  assumption,  but  still  focused  on  certain 
power-law  networks.  Recently,  researchers  investigated  epidemic  models  in  arbitrary  networks  (i.e.,  no 
restrictions  on  network  topology).  However,  all  these  models  only  capture  push-based  infection,  namely  that 
an  infectious  node  always  actively  attempts  to  infect  its  neighboring  nodes.  Very  recently  the  concept  of  pull- 
based  infection  was  introduced  but  was  not  treated  rigorously  Along  this  line  of  research,  the  present  article 
investigates  push-  and  pull-based  epidemic  spreading  dynamics  in  arbitrary  networks,  using  a  nonlinear 
dynamical  systems  approach.  The  article  advances  the  state-of-the-art  as  follows:  (1)  It  presents  a  more 
general  and  powerful  sufficient  condition  (also  known  as  epidemic  threshold  in  the  literature )  under  which 
the  spreading  will  become  stable.  (2)  It  gives  both  upper  and  lower  bounds  on  the  global  mean  infection  rate, 
regardless  of  the  stability  of  the  spreading.  (3)  It  offers  insights  into,  among  other  things,  the  estimation  of 
the  global  mean  infection  rate  through  localized  monitoring  of  a  small  constant  number  of  nodes,  without 
knowing  the  values  of  the  parameters. 
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and  McKendrick  1927;  Bailey  1975;  Anderson  and  May  1991;  Hethcote  2000]  where 
each  individual  has  equal  contact  to  the  others.  Recently,  researchers  (mainly  statis¬ 
tical  physicists)  weakened  the  homogeneity  assumption  by  considering  spreading  in 
heterogeneous  networks,  but  mainly  focused  on  spreading  in  power-law  networks  (see 
Moreno  et  al.  [2002],  Pastor-Satorras  and  Vespignani  [2001,  2002],  Newman  [2003] 
and  Barrat  et  al.  [2008] ).  Very  recently,  computer  scientists  investigated  the  modeling 
of  spreading  dynamics  in  arbitrary  networks  (i.e.,  not  just  power-law  ones)  [Wang  et  al. 
2003;  Ganesh  et  al.  2005;  Chakrabarti  et  al.  2008].  These  models  considered  push- 
based.  spreading  (attacks  or  infection),  by  which  a  virus  always  actively  attempts  to 
infect  the  susceptible  computers.  Push-based  spreading  (and  thus  these  models)  can¬ 
not  accommodate  pull-based  attacks  such  as  “drive-by  download”  [Provos  et  al.  2007] 
that  a  susceptible  computer  can  get  infected  by  connecting  to  a  compromised  Web  site. 
This  inspired  researchers  to  introduce  the  more  realistic  push-  and  pull-based  models 
in  Li  et  al.  [2007],  which,  however,  conducted  mainly  a  simulation-based  study  and 
did  not  offer  deep  insights.  This  calls  for  a  systematic  and  analytic  study  of  push-  and 
pull-based  epidemic  spreading  dynamics  in  arbitrary  networks,  which  motivates  the 
present  article. 

Although  it  is  intuitive  that  push-  and  pull-based  spreading  dynamics  models  can 
accommodate  attacks  that  are  not  captured  by  push-based  models,  one  may  still  ask 
whether  this  is  actually  the  case.  This  is  a  legitimate  question  because  the  abstraction 
of  push-based  spreading  may  be  able  to  accommodate  pull-based  spreading  as  well  (e.g. , 
by  attempting  to  adjust  some  parameters  in  push-based  models).  However,  it  turns  out 
not  to  be  the  case  because  pull-based  spreading  and  push-based  spreading  are  two 
mechanisms  that  exhibit  fundamentally/conceptually  different  behaviors.  Specifically, 
pull-based  spreading  can  accommodate  the  outside  environment  of  a  network  in  ques¬ 
tion  (e.g.,  a  compromised  Web  site  does  not  belong  to  the  network  system  in  which  the 
push-based  epidemic  spreading  takes  place),  whereas  push-based  spreading  cannot 
accommodate  the  outside  environment.  It  therefore  appears  to  be  necessary  to  intro¬ 
duce  a  new  type  of  parameter,  called  a,  which  abstracts  the  probability  that  a  node 
gets  infected  because  of  its  own  reasons  such  as  accessing  a  malicious  Web  site.  As 
such,  push-based  spreading  dynamics  corresponds  to  the  case  of  a  —  0,  and  push-  and 
pull-based  spreading  dynamics  corresponds  to  the  case  of  a  >  0.  In  particular,  when 
compared  with  the  case  of  a  =  0,  in  the  case  of  a  >  0  the  spreading  would  never  die 
out  and  the  dynamics  is  much  more  difficult  to  analyze. 

1.1.  Our  Contributions 

First,  we  present  a  general  sufficient  condition  (also  known  as  epidemic  threshold  in 
the  literature)  under  which  the  push-  and  pull-based  epidemic  spreading  will  become 
stable.  This  result  supersedes  its  push-based  counterpart  given  by  Chakrabarti  et  al. 
[2008], 1  In  other  words,  our  sufficient  condition  is  more  general  and  powerful.  We  also 
give  a  more  succinct  variant  sufficient  condition  which,  among  other  things,  allows  to 
tune  (in  a  quantitative  rather  than  qualitative  fashion)  the  model  parameters  so  as 
to  assure  that  the  spreading  will  become  stable.  This  has  important  implications.  For 
example,  we  can  first  force  the  spreading  dynamics  to  become  stable  and  then  use  the 
global  mean  infection  rate  (i.e.,  the  rate  or  portion  of  infected  nodes)  as  an  index  to 
indicate  the  overall  security  of  a  network  and  to  support  higher-level  decision  making. 

Second,  we  notice  that  it  may  not  be  always  possible  to  force  the  spreading  to  become 
stable  (e.g,  the  cost  may  be  prohibitive).  When  the  spreading  is  not  stable  (which  does 
not  necessarily  mean  there  is  an  outbreak  in  the  setting  of  this  article  with  a  >  0),  it 


1  As  a  side-product,  we  point  out  an  error  in  Chakrabarti  et  al.  [2008] ,  by  which  their  sufficient  condition  was 
mistakenly  claimed  to  be  “necessary”  as  well.  The  details  will  be  given  in  Section  4.1. 
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would  be  important  to  know  what  are  the  worst-case  and  best-case  scenarios  in  terms  of 
the  global  mean  infection  rate.  Towards  this  end,  we  give  upper  and  lower  bounds  of  the 
global  mean  infection  rate.  The  bounds  are  actually  applicable  no  matter  the  spreading 
dynamics  is  stable  or  not  and  are  heuristically  tight  in  some  cases.  Unfortunately,  we 
are  currently  unable  to  precisely  pin  down  the  threshold  above  which  the  upper  bound 
would  be  tight,  and  below  which  the  lower  bound  would  be  tight.  Still,  we  note  that 
results  related  to  nonstable  scenarios  are  very  difficult  to  obtain  in  general. 

Third,  we  offer  the  following  deeper  insights  that  are,  to  our  knowledge,  the  first  of 
their  kind  when  considering  arbitrary  networks.  When  the  spreading  is  stable: 

— we  show  how  node  degree  governs  the  node  infection  rate; 

— we  give  a  condition  under  which  the  popular  mean  field,  approach  is  applicable  to 
study  push-  and  pull-based  dynamics  in  arbitrary  networks; 

— we  show  how  the  global  infection  rate  can  be  estimated  by  monitoring  a  small  number 
of  nodes  selected  in  a  certain  fashion  without  knowing  the  values  of  the  parameters 
and  independent  of  the  size  of  the  network. 

The  rest  of  the  article  is  organized  as  follows.  In  Section  2  we  discuss  the  related  prior 
works.  Because  our  work  is  built  on  top  of  the  concept/model  of  push-  and  pull-based 
spreading  dynamics  introduced  in  Li  et  al.  [2007],  we  briefly  review  it  in  Section  3. 
In  Section  4  we  present  some  sufficient  conditions  (i.e.,  epidemic  thresholds)  under 
which  the  spreading  will  become  stable,  and  present  some  upper  and  lower  bounds  on 
the  global  mean  infection  rate.  In  Section  5  we  present  some  deeper  insights  that  are 
especially  useful  in  the  absence  of  certain  system  information.  We  conclude  the  work 
in  Section  6  with  open  questions  for  future  research. 

2.  RELATED  WORK 

The  study  of  epidemic  spreading  dynamics  on  complex  networks  has  become  a  hot  topic. 
From  a  computer  science  perspective,  this  perhaps  has  been  catalyzed  by  the  ubiquitous 
presence  of  real-life  networks,  including  social  networks,  Internet,  and  email  networks. 
The  study  has  been  inherently  multidisciplinary,  and  researchers  have  investigated 
approaches  such  as  epidemiology,  dynamical  systems,  stochastic  processes,  percolation 
theory,  generating  functions,  algebraic  graph  theory,  and  statistical  physics.  As  such, 
a  thorough  and  complete  review  of  the  literature  is  clearly  beyond  the  scope  of  the 
present  article.  Instead,  we  here  focus  on  a  specific  perspective,  and  classify  the  most 
relevant  existing  studies  of  computer  virus/malware  dynamics  into  the  following  three 
generations.  (Because  our  focus  is  not  about  immunization,  we  do  not  aim  to  discuss 
the  vast  volume  of  literature  in  this  subfield  as  well.) 

The  first  generation,  initiated  by  Kephart  and  White  [1991,  1993]  and  followed  by 
numerous  variant  studies  such  as  Zou  et  al.  [2002,  2003],  focused  on  homogeneous  net¬ 
works.  Such  homogeneous  models  essentially  require  the  underlying  communications 
to  form  complete  graphs.  This  is  unrealistic  because  some  computers  (or  IP  addresses) 
are  blocked  from  accessing  certain  other  computers  (or  IP  addresses),  and  because 
malware  spreading  does  not  necessarily  exploit  the  fully  random  scanning  strategy 
[Chen  and  Ji  2005].  Moreover,  the  spreading  processes  that  exploit  social  networks 
(including  email  communication  networks)  are  certainly  not  complete  graphs,  but 
rather  follow  the  social  networks’  topologies. 

The  second  generation  focused  on  specific  heterogenous  (i.e.,  power-law)  networks, 
for  which  we  refer  to  the  most  recent  excellent  books  [Barrat  et  al.  2008;  Newman 
2010]  for  a  thorough  treatment  on  the  topic  and  for  the  large  body  of  literature.  From  a 
technical  perspective,  we  here  highlight  that  there  are  mainly  two  approaches  to  tack¬ 
ling  such  heterogeneity.  One  approach  is  to  exploit  the  specific  properties  of  the  degree 
distributions,  especially  the  Barabasi-Albert  power-law  degree  distribution  and  the 
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affiliated  preferential  attachment  model  [Barabasi  and  Albert  1999],  which,  however, 
are  not  without  limitations  (see,  for  example,  Faloutsos  et  al.  [1999]  and  Willinger  et  al. 
[2009]).  The  large  volume  of  research  results  in  this  widely  known  approach  were  nicely 
summarized  in  the  2008  book  of  Barrat  et  al.  [2008].  The  other  (perhaps  less  known) 
approach  is  to  exploit  generating  functions  [Wilf  1994].  This  approach  has  successfully 
made  a  connection  to  the  percolation  theory  [Callaway  et  al.  2000;  Newman  et  al.  2001; 
Newman  2007]  and  has  successfully  identified  epidemic  thresholds.  Moreover,  this 
approach  is  applicable  to  random  graphs  for  a  given  degree  sequence  (which  can  cor¬ 
respond  to  power-law  distributions)  as  introduced  by  Molloy  and  Reed  [1995,  1998].  A 
systematic  treatment  of  this  approach  can  be  found  in  the  2010  book  of  Newman  [2010]. 

The  third  generation  considers  arbitrary  heterogenous  networks.  The  earlier  in¬ 
vestigations  focused  on  push-based  spreading  [Wang  et  al.  2003;  Ganesh  et  al.  2005; 
Chakrabarti  et  al.  2008].  These  investigations  made  a  clear  connection  between  the 
fate  of  push-based  dynamical  processes  in  arbitrary  networks  and  the  algebraic  prop¬ 
erties  of  graphs  (more  specifically,  the  largest  eigenvalue  of  the  adjacency  matrix). 
The  present  work  enriches  the  dynamical  processes  by  considering  both  push-  and 
pull -based  epidemic  spreading,  a  concept  first  introduced  in  Li  et  al.  [2007],  which, 
however,  did  not  go  deep  enough  analytically.  To  be  more  precise,  the  two  papers  that 
are  the  predecessors  to  the  present  work  are  Chakrabarti  et  al.  [2008],  which  presents 
the  state-of-the-art  epidemic  threshold  in  push-based  models,  and  Li  et  al.  [2007], 
which  introduced  the  push-  and  pull-based  model  but  without  giving  deep  results.  We 
will  demonstrate  how  our  results  supersede  theirs. 

Finally,  we  note  that  the  epidemic  spreading  dynamics  model  considered  in  the 
present  article  is  essentially  the  so-called  Susceptible-Infection-Susceptible  (SIS) 
dynamics  in  nature  [Bailey  1975],  which  are  very  different  from  gossip  processes  (e.g., 
Demers  et  al.  [1987],  Karp  et  al.  [2000],  Kempe  et  al.  [2001],  Kempe  and  Kleinberg 
[2002],  Shah  [2009]),  which  are  essentially  Susceptible-Infection  (SI)  dynamics  in 
nature. 

3.  A  BRIEF  REVIEW  OF  THE  PUSH-  AND  PULL-BASED  MODEL 

We  consider  push-  and  pull-based  epidemic  spreading  in  complex  networks,  which  are 
naturally  represented  as  graphs.  Because  any  topology  could  be  possible  and  we  want 
to  draw  insights  that  are  widely  applicable,  we  do  not  make  any  assumption  on  the 
networks’  topologies,  according  to  which  the  epidemic  spreading  takes  place.  In  the 
rest  of  this  section  we  review  the  push-  and  pull-based  model  introduced  in  Li  et  al. 
[2007], 

Specifically,  we  consider  a  finite  network  graph  G  —  ( V ,  E ),  where  V  =  {1,2,...,  n\ 
is  the  set  of  nodes  or  vertices  and  E  is  the  set  of  edges  or  arcs.  Denote  by  A  =  [avu] 
the  adjacency  matrix  of  G,  where  avv  —  0  and,  if  and  only  if  (u,  v)  e  E  for  v  ^  u,  we 
have  avu  —  1.  The  graph  G  captures  the  topology  according  to  which  the  spreading 
takes  place,  where  u  can  directly  infect  v  only  when  (u,  v)  e  E.  For  undirected  graph, 
let  deg(r)  denote  the  degree  of  node  v  in  G.  The  discussion  in  the  article  focuses  on 
undirected  graphs,  but  can  be  adapted  to  the  setting  of  directed  graphs  by  replacing 
deg(  v)  as  the  in -degree  of  v  in  directed  graph  G. 

Consider  a  discrete-time  model  with  time  t  =  0, 1,  2 . At  any  time  t,  a  node  v  e  V 

is  either  susceptible  or  infectious.  (In  this  article,  “infectious”  and  “infected”  are  used 
interchangeably.)  Denote  by  sv(t)  the  probability  that  v  e  V  is  susceptible  at  time 
t,  and  iv(t)  the  probability  that  v  e  V  is  infectious  at  time  t.  The  model  invariant  is 
sv(t)  +  iv(t)  —  1  for  any  t  >  0.  Moreover,  a  susceptible  node  v  may  become  infectious 
at  a  single  time  step  because  of  push-  or  pull-based  infection,  and  an  infectious  node 
may  become  susceptible  because  of  defense  or  cure.  This  explains  why  our  model 
is  SIS  (Susceptible-Infection-Susceptible)  in  nature,  except  that  it  considers  both 
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Fig.  1 .  The  state  transition  diagram  for  node  v  e  V . 

push-  and  pull-based  infections.  To  accommodate  push-  and  pull-based  spreading,  we 
use  the  following  parameters. 

— a:  This  is  the  pull-based  infection  capability,  namely,  the  probability  a  susceptible 
node  becomes  infectious  at  a  discrete  time  step  because  of  its  own  activity  (e.g., 
connecting  to  a  malicious  Web  site  which  may  not  belong  to  G ). 

— y:  It  is  the  push-based  infection  capability,  namely,  the  probability  that  an  infectious 
node  u  successfully  infects  a  susceptible  node  v,  where  (u,  v)  e  E.  In  the  case  of 
undirected  graph,  it  is  natural  that  yuv  =  yvu  for  all  (u,  v)  e  E. 

— (i:  This  is  the  cure  capability,  namely,  the  probability  that  an  infectious  node  becomes 
susceptible  at  a  single  time  step. 

The  main  notations  used  throughout  the  work  are  summarized  next. 


•  •  •  >  ^n, A 

the  eigenvalues  of  the  adjacency  matrix  A  of  network  graph  G  with 
li  a  >  •  ■  ■  >  kn ,a  (in  modulus) 

^■max(W 

the  largest  eigenvalue  (in  modulus)  of  matrix  M 

MT 

the  transpose  of  matrix  (or  vector)  M 

Sv{t) 

the  probability  that  node  v  is  susceptible  at  time  t 

iv(t ) 

the  probability  that  node  v  is  infectious  at  time  t 

i(t ) 

the  global  mean  infection  rate  (or  probability)  J2veV  *«(0; 

i  —  lim^oo  i(t)  is  also  used  in  the  case  the  spreading  is  stable 

i[t0,  h] 

the  average  of  i(t)  over  time  interval  t  e  ho,  h ] 

( r ) 

the  mean  of  random  variable  r 

a 

the  probability  that  a  node  becomes  infectious  at  a  single  time  step 
because  of  pull-based  infection 

p 

the  probability  that  an  infectious  node  becomes  susceptible  at  a 
single  time  step 

Y 

the  probability  that  a  susceptible  node  v  is  infected  by  an  infectious 
neighbor  u  where  ( u ,  v)  e  E  at  a  single  time  step 

Sv(t ) 

the  probability  that  a  susceptible  node  v  becomes  infectious  at  time 
t  +  1  because  of  its  infectious  neighbors  {u  :  ( u ,  v)  e  E }  at  time  t 

\n 

the  n  x  n  identity  matrix  (i.e.,  ajy  =  1  if  j  =  j',  and  a}y  =  0 
otherwise) 

Figure  1  depicts  the  state  transition  diagram  [Li  et  al.  2007],  according  to  which 
node  v  e  V  changes  its  state.  As  in  Chakrabarti  et  al.  [2008],  we  may  assume  that 
the  (neighboring)  nodes’  states  are  independent.  Then,  at  time  t  +  1,  v  may  become 
infectious  because  of  pull-based  infection  with  probability  a,  or  because  of  push-based 
infection  with  probability  Sv(t),  where 

Sv(t)  =  1-  f]  [1  -yiu(t)\.  (3.1) 

(u,v)eE 
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As  a  result,  the  master  equation  of  the  nonlinear  dynamical  system  is  [Li  et  al.  2007] 

sv(t  +  1)  =  [(1  -  a)(l  -  <5„(£))]s„(£)  +  pivit) 

iv(t  +1)  =  [1  -  (1  -  a)(l  -  <5„(t))]s„(£)  +  (1  -  P)iv(t), 

namely 


iv(t  +  1)  — 


1  -  a)  Yl  (1  _  Yiu.it) 


( u,v)eE 


(1  -  iv(t))  +  (1  -  p )ivit). 


(3.2) 


Note  that  the  preceding  master  equation  preserves  the  invariant  sv(t)  +iv(t)  —  1. 


4.  EPIDEMIC  THRESHOLDS  AND  BOUNDS  ON  INFECTION  RATE 

In  this  section  we  address  the  following  questions  in  the  aforesaid  push-  and  pull-based 

model. 

— What  are  the  sufficient  conditions  (i.e.,  epidemic  thresholds  )  under  which  the  spread¬ 
ing  will  become  stable  without  resorting  to  any  approximation  (which  is  often 
used  for  analyzing  nonlinear  dynamical  systems)?  We  address  this  question  in 
Sections  4. 1-4.2. 

— Can  we  bound  the  node  infection  rate  without  requiring  the  spreading  to  become 
stable  (Section  4.3)?  Such  results  would  be  useful  especially  when  it  is  impossible  or 
prohibitive  to  “manipulate”  or  force  the  spreading  dynamics  to  become  stable. 

In  addition,  we  will  discuss  the  applications  of  these  theoretic  results  (Section  4.4). 


4.1.  A  General  Epidemic  Threshold 

Unlike  the  push-based  model,  where  a  —  0  and  thus  lim/  ^^  ijt)  —  0  for  all  v  is 
a  trivial  equilibrium  state,  we  do  not  have  such  a  leverage  for  the  push-  and  pull- 
based  model  because  a  >  0  and  thus  it  is  almost  always  true  that  iv(t)  >  0,  meaning 
that  the  spreading  would  not  die  out.  Although  we  are  unable  to  point  out  where  the 
equilibrium  state  is,  we  show  that  there  must  be  an  equilibrium  state  (i.e.,  existence) 
that  is  exponentially  globally  stable. 

Theorem  4.1  (A  general  sufficient  condition  under  which  the  spreading  will  be¬ 
come  stable).  Let  Uj . . . . ,  )*]  be  an  equilibrium  of  nonlinear  dynamical  system  (3.2), 
H  =  diag[hv]n=1  with 


li 


V 


-P  +  d-«)  Y\  (i-K) 

(u,v)eE 


and  the  other  parameters  be  specified  as  before.  If 

hmaxiH  +  yil  -  a)/\)  <  1,  (4.1) 

then  system  (3.2)  is  globally  exponentially  asymptotically  stable,  namely  that 
lim/^oo  iv(t)  —  i*  holds  for  v  —  1 . n  regardless  of  the  number  of  the  initially  in¬ 

fected  nodes. 

Proof.  Let  lit )  =  [ii(t),  ■  ■  • ,  in(t)\T .  Then  Eq.  (3.2)  indicates  lit  +  1)  =  filit))  for  some 
continuous  function  f  :  [0, 1]"  -»  [0,  l]'1.  Since  [0, 1]"  is  convex,  Brouwer’s  fixed  point 
theorem  [Yoshida  1971]  says  that  there  exists  at  least  one  equilibrium  i*  —  \i  f  . . . ,  i*]T , 
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that  is, 

i*v=  l-(l-a)  fl  (t-K)  a-ip  +  a-^VreV. 

( u,v)eE 

Let  zv(t)  =  iv(t)  —  i*  for  all  v  =  1, . . . ,  n.  Therefore,  we  have 
zv(t+  1)  =  -fzv(t)  +  (1  -  a)  ]~[  (1  -  yi*)zv(t) 

( u,v)eE 

+(i-o)  na-m-  n  (1  -  yiu(t))  (1  -  iv(t)). 

_(u,v)eE  ( u,v)eE 

Note  that  for  each  v  =  1 . n, 

n  d-yty-  n  Q-Yhti)) 

( u,v)eE  ( u,v)eE 

=  yJ2  zu'(t)  •  ]"~[  (1  -yi*)-  ]~[  (1  -yikit)), 

(u',v)eE  ( u,v)eE ,  u<u'  (k,v)eE,  k>uf 

we  have 

\zv(t  +  1)|  <  hv\zv{t)\ 

+  (1  —  a)y  22  \zu(t)\  ]""[  (1  -  yi*)  ]~[  (1  -  yikit)) 

( u,v)eE  ( u,v)eE ,  u<u'  (k,v)eE,  k>u! 

<  hv\zv(t)\  +(1  -a)y  ^2  \zu(t)\. 

(u,v)eE 

Define  the  following  comparison  system 

9v(t  +  1)  =  hvOv(t)  +  (1  -  a)y  22  (4-2^ 

( u,v)eE 

with  initial  values  6(0)v  —  |z„(0)|  >  0  for  v  =  1, . . . ,  n.  This  implies  that  \zv(t)\  <  9v(t) 
holds  for  all  v  =  1, . . . ,  n  and  t  >  0.  Let  ®(t)  —  [6\{t), ,  0ra(t)]T.  Then  Eq.  (4.2)  has  the 
following  matrix  form 

©(£  +  1)  =  [H  +  y(l  -  a)A]©(£),  (4.3) 

where  A  is  the  adjacency  matrix  of  the  network  graph.  Since  the  graph  is  connected, 
the  matrix  X  =  H  +  y(  1  —  a) A  is  irreducible  and  nonnegative.  The  Perron-Frobenius 
theorem  [Horn  and  Johnson  1985]  says  that  the  spectral  radius  of  X  is  its  largest 
eigenvalue  in  modulus.  Therefore,  condition  (4.1)  implies  that  all  eigenvalues  of  the 
matrix  X  are  less  than  1  in  modulus.  This  implies  that  system  (4.2)  converges  to  zero 
exponentially.  Because  |z„(f)|  <  9v(t),  lim^oo  |z„(t)|  =  0  holds  for  all  v  —  1, . . . ,  n,  that 
is,  lim^oo  ijt)  =  i*  holds  for  all  v  =  1, . . . ,  n  and  the  convergence  is  at  an  exponential 
pace.  This  completes  the  proof.  □ 

The  previous  theorem  says  that  when  the  sufficient  condition  is  satisfied,  the  spread¬ 
ing  will  converge  to  the  equilibrium  state  (but  not  dying  out)  at  an  exponential  pace, 
regardless  of  the  number  of  initially  infected  nodes. 

Comparing  the  sufficient  condition  presented  in  Chakrabarti  et  al.  [2008]  to  ours.  Our 
result  supersedes  the  sufficient  condition  for  the  case  of  a  —  0  presented  in  Chakrabarti 
et  al.  [2008]  because  of  the  following  corollary  of  Theorem  4.1. 
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Corollary  4.2.  In  the  case  a  =  0,  if 

A.i.a  <  -•  (4.4) 

X 

the  spreading  dies  out  regardless  of  the  number  of  initially  infected  nodes,  that  is, 
lim^oo  iv{t )  =  0  for  all  v  —  1, . . . ,  n. 

Proof.  Note  that  a  —  0  and  i*  =  0  implies  H  =  (1  —  (i)In  in  Theorem  4.1.  So,  condition 
(4.1)  becomes  A.max [( 1  —  f>)In  +  yA]  <  1,  which  is  equivalent  to  1  -  f  +  yXmax(A)  <  1, 
namely  condition  (4.4).  Hence,  the  corollary  holds.  □ 

While  Corollary  4.2  corresponds  to  the  sufficient  condition  given  in  Chakrabarti  et  al. 
[2008],  meaning  that  our  sufficient  condition  is  strictly  more  powerful,  the  sufficient 
condition  for  the  case  of  a  =  0  is  not  compatible  with  the  more  succinct  sufficient 
condition  for  the  case  of  a  >  0  (presented  in  Section  4.2  ).  This  also  illustrates  why  the 
treatment  in  the  case  of  a  >  0  is  different. 

Caveat:  The  conditions  are  sufficient  but  not  necessary.  Chakrabarti  et  al.  [2008] 
claimed  through  their  Theorem  2  that  r  =  ^  |  (in  their  terminology),  which 

is  equivalent  to  the  preceding  condition  (4.4),  is  also  the  necessary  condition  for  the 
spreading  to  die  out  for  the  case  of  a  —  0.  Now  we  point  out  that  their  proof  is  flawed 
and  explain  why.  We  use  their  terminology  so  that  it  is  easier  to  check  against  their 
proof,  which  is  presented  in  Appendix  A  of  Chakrabarti  et  al.  [2008].  Their  proof 
strategy  is  to  prove  “if  the  spreading  dies  out,  then  r  >  As  shown  in  the  derivation 
of  their  Eq.  (A6),  this  requires  to  prove  “if  the  system  is  asymptotically  stable,  then 
Xi'S  =  1  —  S  +  ftXi'A  <  1  for  Vi,  where  S  =  Vg(0)  =  fA  +  (1  —  5)1.”  This  subsequently 
requires  to  prove  “if  the  system  is  asymptotically  stable  at  P  —  0,  then  the  eigenvalues 
of  Vg(0)  are  less  than  1  in  absolute  value.”  The  proof  of  this  claim  was  attributed  to 
their  Lemma  1,  which  however — as  we  now  point  out — cannot  get  through  because  the 
lemma  actually  states  the  opposite,  namely  “if  the  eigenvalues  of  Vg(0)  are  less  than 
1  in  absolute  value,  then  the  system  is  asymptotically  stable  at  P  —  0.”  Therefore,  the 
necessity  proof  in  Chakrabarti  et  al.  [2008]  is  flawed. 

Having  pointed  out  the  flaw,  one  would  wonder  whether  it  is  possible  to  prove  “if 
the  system  is  asymptotically  stable  at  P  =  0,  then  the  eigenvalues  of  Vg(0)  are  less 
than  1  in  absolute  value”  or  “r  >  f  in  Chakrabarti  et  al.  [2008]  is  indeed  the  necessary 
condition  for  the  spreading  to  die  out.”  It  turns  out  not  to  be  the  case.  Conceptually,  for 
a  nonlinear  dynamical  system,  when  the  largest  eigenvalue  of  the  Jacobin  matrix  at 
equilibrium  equals  to  one  exactly,  it  may  not  imply  that  the  system  is  not  asymptotically 
stable.  For  example,  consider  a  simple  system 

x(t  +  1)  =  x(t)  —  y(x(t))2  (4.5) 

with  0  <  y  <  1/2  and  x(0)  e  [0, 1].  Since  x(t)  monotonically  decreases  and  belongs  to 
the  interval  [0, 1]  for  all  t,  it  must  converge  to  a  fixed  point  of  the  system.  Since  zero  is 
the  only  fixed  point  of  system  (4.5),  we  conclude  that  lim,_  >0O  x(t)  =  0.  Even  though  the 
modulus  of  the  derivative  of  the  right -hand  side  at  zero  is  1,  the  system  converges  to 
zero  if  the  initial  value  is  less  than  1.  In  other  words,  their  condition  (as  well  as  ours  ) 
is  sufficient  but  not  necessary. 

Having  pointed  out  that  their  proof  strategy  for  necessity  cannot  get  through,  one 
may  wonder  whether  the  necessity  can  be  proven  with  a  different  strategy.  It  turns  out 
not  to  be  the  case,  as  we  show  here  with  a  concrete  counter-example  that  a  sufficient 
condition  is  not  necessary  in  general.  Our  counter-example  that  follows  shows  that  (in 
our  terminology)  there  exist  graphs  where  the  infection  rate  can  still  go  to  zero  when 
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A.i_a  =  r,,  which  immediately  violates  the  necessity  of  the  sufficient  condition.  Let  us 
consider  a  graph  with  only  two  nodes  linked  with  a  single  edge.  Then,  we  have  the 
following  dynamical  system  to  describe  the  infection  rate. 

hit  +  1)  =  yhit) [1  -  ii(t)]  +  (1  -  P)h(t) 
hit  +  1)  =  yhit)[l  -  hit)]  +  (1  -  /3)hit) 


In  this  case,  the  adjacency  matrix  is 


0  1 
1  0 


,  whose  largest  eigenvalue  in  modulus  is 


A-i ,a  =  1.  So,  under  the  condition  Ai,a  =  we  have  ft  —  y.  Suppose  0  <  y  <  1/2.  Then, 

0  <  hit)  <  1  implies  1  —  y  —  yhit)  >  0.  Let  c  e  [0, 1]  such  that  hit)  <  c  and  hit)  <  c  at 
time  t,  we  have  y  —  cy  >  0  and  thus 


hit  +  1)  =  [1  —  y  —  yhit)\hit)  +  yhit)  <  [1  —  y  —  yh(t)]c  +  yhit) 

=  hit)iy  -  yc)  +  ci  1  -  y)  <  ciy  -  yc)  +  c(l  -  y)  =  c  -  c2y. 

In  a  similar  fashion  we  have  hit  +  1)  <  c  —  c2y.  Still  considering  the  comparison 
system  (4.5)  with  x(0)  =  1,  we  observe  that  iiit)  <  xit)  and  hit)  <  xit)  hold.  From  the 
preceding  reasoning,  we  have  lim^ooxft)  =  0.  This  implies  that  lim,  _><*)  hit)  —  0  and 
lim^oo  hit)  —  0,  which  serves  as  a  desired  counter-example. 


4.2.  A  More  Succinct  Epidemic  Threshold 

The  previous  sufficient  condition  is  general  but  not  succinct.  This  motivates  us  to 
present  the  following  less  general,  but  more  succinct,  sufficient  condition. 

Theorem  4.3  (A  more  succinct  sufficient  condition  under  which  the  spreading  will 
become  stable).  Let  m  —  maxdeg(r).  In  the  case  fi  <  (1  —  a)(l  +  (1  —  y)m)/2,  if 

veV 


kl.A  < 


a  +  f> 
y(l  -  a)' 


then  the  spreading  will  become  stable  regardless  of  the  initial  number  of  infected  nodes. 
In  the  case  f  >  (1  —  a)(l  +  (1  -  y)m)/ 2,  if 


< 


+  (1  —  a)(l  —  y)n 


yi  1  -  a) 

then  the  spreading  will  become  stable  regardless  of  the  initial  number  of  infected  nodes. 


Proof.  Let  xf  =  max(|l  —  a~p\,  \  —f  +  (1  —  a)(l  —  y)m  |),  and  the  other  parameters  be 
specified  as  earlier.  Because 


(1  -  «X1  -  y)m  <  (1  -  «)  n(1- Yi*u)  <  (1  -  a), 

( u,v)eE 


we  have 


hv  <  maxj 1 1  -  a  -  fi\,  \  -  fi  +  (1  -  a)(l  -  y)m |}. 

So,  the  largest  eigenvalue  of  H  +  il  —  a)yA  is  smaller  than  that  of  if  I, l  +  (l  —  a)y  A.  Note 
that  if 

1  —  \lf 

xjr  +  yil  -  aUpA  <  1,  or  A.i,a  <  — T, - t,  (4.7) 

yi  1  -  a) 

then  /.max III  +  (1  —  «)yA|  <  1.  This  means,  by  applying  Theorem  4.1,  that  the  spreading 
will  become  stable  regardless  of  the  number  of  initially  infected  nodes. 
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Observe  that  in  the  case  P  <  (l-a)(l  +  (l  -  y)m)/ 2,  sufficient  condition  (4.7)  becomes 

,  «  +  0 

H.A  <  — TZ - 7- 

/(I  -  a) 

In  the  case  P  >  (1  —  a)(l  +  (1  —  y)m)/ 2,  sufficient  condition  (4.7)  becomes 

1  —  +  (1  —  a)(l  —  y)m 


H,A  < 


y(l  —  a) 


This  completes  the  proof.  □ 

Comparing  the  sufficient  condition  given  in  Li  et  al.  [2007]  to  ours.  Li  et  al.  [2007] 
also  presented  a  sufficient  condition  for  the  case  a  >  0,  namely 

a  +  [  a  +  P-2 

Ha  <  - and  A„.a  >  - , 


y 


y 


where  A„,a  is  the  smallest  (in  modulus)  eigenvalue  of  A.  However,  in  the  derivation  of 
their  sufficient  condition  they  used  the  following  approximation  of  Eq.  (3.2) 

iv(t  +  1)  a  +  (1  -  a  -  P)iv{t)  +  y  ^  iu(t), 

( u,v)eE 


which  omits  all  the  nonlinear  terms  and  is  therefore  quite  coarse. 

In  contrast,  our  sufficient  condition  given  in  Theorem  4.3  is  derived  without  using 
any  approximation.  As  a  result,  our  sufficient  condition  is  advantageous  because  it  is 
both  more  concise  and  weaker.  Our  sufficient  condition  is  more  concise  because  it  says 
that  A.„,a  is  irrelevant,  whereas  their  sufficient  condition  unnecessarily  involves  /,„A- 
Our  sufficient  condition  is  weaker  because  when  p  <  (1  —  a)(l  +  (1  —  y)m)/ 2,  Ha  < 
implies  H.a  <  f'-i  Pu)  but  the  latter  does  not  imply  the  former. 


4.3.  Bounding  the  Node  Infection  Rate 

The  aforesaid  theorems  state  sufficient  conditions  under  which  the  spreading  will 
become  stable.  Because  we  are  unable  to  compute  the  equilibrium  state  for  a  >  0,  the 
next  natural  question  is:  Can  we  bound  the  nodes’  infection  rates?  Now  we  address  this 
question. 

_ Theorem  4.4  (Bounds  of  iv(t)  regardless  of  the  stability  of  the  spreading).  Let 

lim^ooiAC  denote  the  upper  bound  of  the  limit  of  iv{t),  and  lim,  >ociv(t)  denote  the 
lower  bound  of  the  limit  ofiv{t).  Then,  we  have 


where 


limf_).ooi„(0  <  Of  and  limf  _^i„(t)  >  0V  , 
1  -  (1  -or)(l  -  y)deg(u) 


"  min{l  +  P  —  (1  —  a)(l  —  y)desW,  1}’ 

f  l-(l-a)(l-rl,)^>  (1  _  )(1  _  vy)degU0  >  a 

0-  =  J  l+y8-(l-ff)(l-)/v)des('’>  1  7V>  —P 

[  ((1  -  a)(l  -  yv)de^v)  -  p)0f  +  1  -  (1  -  a)(l  -  yr)deg(,d  otherwise 

with  v  —  min{l  —  p ,  a}. 

Proof.  First,  consider  the  upper  bound.  Because 

iv(t  +  1)  <  [1  -  (1  -  a)(l  -  y)deg(l,)](l  -  iv(t ))  +  (1  -  P)iv(t), 
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we  have 


iv(t  +  1)  -  0+  <  [(1  -  «)(1  -  y)deg(u)  -  Mv(t)  -  0+). 

In  the  case  (1  —  a)(l  —  y)deg(l,)  >  fi  >  0,  there  are  two  scenarios.  If  there  exists  some  to 
such  that  iv(to)  <  0+ ,  then  we  have  iv(t  +  1)  <  9+  for  all  t  >  to-  If  iv(t)  >  f°r  t>  then 

\iv(t+  1)  -  0+|  <  ltd  -  «)(1  -  y)degW  -  mUt)  -9+\. 

Because  |(1  —  a)(l  —  y)deg(")  —  <  1,  we  have 

lim (iv(t)  -9+)  =  0. 

t — >  OO 

So,  we  have  lim^ooi„(t)  <  in  both  scenarios. 

In  the  case  (1  —  a)(l  -  y)deg(,;)  <  /6,  we  immediately  have 

iv(t  +  1)  <  [tl  -  «)(1  -  y)deg(u)  -  p]iv(t)  +  1  -  (1  -  «X1  -  y)deg(i;) 

<  1 -tl -aXl  -  y)deg(u). 

Therefore,  in  both  cases,  we  have  lim/^^f  ,  c * )  <  e- 

Second,  consider  the  lower  bound.  Since  1  —  (1  —  a)(l  —  Sv(t))  >  a,  we  can  conclude 
that  iv(t  +  1)  >  a(l  —  iv(t))  +  (1  -  P)iv{t).  The  fact  iv(t)  e  [0, 1]  implies  that  the  previous 
right-hand  side  is  greater  than  v.  So,  we  have  iv(t)  >  v.  Thus,  we  have 

iv(t  +  1)  >  [1  -  (1  -  a XI  -  Kv)degW](  1  -  iv(t))  +  (1  -  PMtl 

In  the  case  1 1  -  crXl  -yv)ie^v)  >  d,  we  can  rewrite  the  preceding  inequality  as 

iv(t  +  1)  -  9~  >  [(1  -  a)(l  -  yu)deg(l,)  -  -  9~). 

We  observe  that  there  are  two  scenarios.  If  there  exists  some  t\  such  that  iv(t{)  >  9~, 
then  iv(t)  >  9~  holds  for  all  t  >  ti.  If  iv{t)  <  9~  holds  for  all  t  >  0,  then 

I iv(t  +  1)  -  9~ |  <  |  [( 1  —  a)(  1  -  yr)deg<"’  -  £]||i„(*)  -  0“|. 

Because  |[(1  -  a)(l  -  yy)deg(l,)  —  dll  <  1,  we  have  lim^oo  \iv(t  +  1)  —  9~\  =  0.  Therefore, 
we  have  lim,  >  9~  in  both  scenarios. 

In  the  case  (1-aXl-  y)AesM  <  P,  we  immediately  have 

lim,^i„(t  +  1) 

>  lim^^{[(l  -  «X1  -  yv)degll,)  -  p]iv(t))  +  1  -  (1  -  «X1  -  Kv)deg(l’’} 

>  [(1  -  «X1  -  yy)deg(u)  -  +  1  -  (1  -  «X1  -  yv)deg(u) 

>  [(1  -  aXl  -  yiOdeg(u)  -  PK  +  1  -  (1  -  a)(l  -  yv)deg(l,). 

Therefore,  in  both  cases,  we  have  limf  .  This  completes  the  proof.  □ 

The  inequalities  we  use  to  prove  Theorem  4.4  indicate  that  when  the  actual  infection 
rate  is  significantly  smaller  than  1,  the  lower  bound  would  be  tighter;  when  the  actual 
infection  rate  is  close  to  1,  the  upper  bound  would  be  tighter.  This  heuristics  is  confirmed 
by  our  simulation  results.  However,  it  is  challenging  to  precisely  pin  down  the  infection 
rate  threshold,  above  which  the  upper  bound  is  tight  and  below  which  the  lower  bound 
is  tight.  We  leave  this  to  future  research,  while  pointing  out  that  a  key  difficulty  comes 
from  the  fact  hat  the  spreading  may  not  be  stable. 
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4.4.  Applications  of  the  Thresholds  and  Bounds 

In  addition  to  their  theoretic  significance,  the  thresholds  and  bounds  have  good  appli¬ 
cations,  especially  for  quantitatively  guiding  the  operation  of  tuning  the  parameters  so 
as  to  achieve  the  desired  outcomes. 

First,  the  thresholds  can  be  used  to  guide  the  adjustment  of  parameters  so  as  to 
make  the  spreading  stable.  For  this  purpose,  it  is  intuitive  to  decrease  y  (e.g.,  by 
imposing  a  more  thorough  examination  on  message  packets)  and/or  a  (e.g.,  by  blocking 
access  to  potentially  malicious  Web  sites),  and/or  increase  p  (e.g.,  by  deploying  more 
powerful,  and  thus  likely  more  expensive,  defense  tools).  It  would  be  less  intuitive  that 
the  defender  can  also  seek  to  decrease  M.a  by  deleting  edges  and/or  nodes.  It  would 
be  far  less  intuitive  that  the  optimal  strategy  is  to  delete  edges/nodes  so  as  to  reduce 
Xi,a  as  much  as  possible,  rather  than  to  delete  (for  example)  the  largest-degree  nodes 
as  noted  in  Chakrabarti  et  al.  [2008].  In  any  case,  how  much  adjustment  is  enough 
to  assure  that  the  spreading  will  become  stable?  The  thresholds  (more  specifically,  the 
inequalities)  offered  in  Theorems  4.1  and  4.3  can  be  used  to  answer  such  questions 
quantitatively.  Moreover,  if  the  costs  incurred  due  to  the  adjustment  of  parameters  are 
known,  then  Theorems  4.1  and  4.3  also  provide  a  basis  for  cost-effectively  adjusting 
the  parameters. 

Second,  consider  the  case  of  a  =  0.  Suppose  it  is  impossible  or  over-costly  to  make 
the  spreading  die  out  (e.g.,  because  the  parameters  a,  ft,  y,  and  M,a  cannot  be  made 
arbitrarily  small/large,  while  assuring  that  the  network  remains  functioning).  This 
means  that  the  conditions  given  in  Theorems  4.1  and  4.3  are  not  satisfied  any  more. 
Even  if  the  spreading  might  not  be  stable,  we  can  still  utilize  the  bounds  given  in 
Theorem  4.4  to  achieve  some  useful  goals.  On  one  hand,  the  upper  bound  can  be  used 
to  estimate  the  worst-case  infection  rate  and,  perhaps  more  importantly,  can  be  used  to 
guide  the  tuning  of  parameters  in  the  bound  6>+  so  as  to  reduce  it  as  much  as  possible. 
Although  it  is  not  necessarily  a  tight  estimation  of  the  infection  rate,  it  still  would 
be  useful  because  we  now  can  control  the  worst-case  infection  rate  under  a  desired 
level.  On  the  other  hand,  the  lower  bound  can  be  used  to  estimate  in  a  sense  the  best- 
case  scenario.  Putting  the  bounds  together,  we  further  get  a  clue  on  how  tight/loose  the 
bounds  are  and,  when  the  two  bounds  are  close  to  each  other,  what  is  the  small  interval 
in  which  the  infection  rate  resides.  Finally,  assuming  the  relevant  cost  functions  are 
known,  the  tuning  of  parameters  can  also  be  made  cost  effective.  Such  quantitative 
guidance  would  be  always  preferred  to  the  qualitative  intuitions. 

Third,  consider  the  case  of  a  >  0,  which  means  that  the  spreading  would  never  die 
out.  In  this  case,  even  if  we  cannot  make  the  spreading  stable  (similarly  because  it 
is  either  impossible  or  over-costly  in  practice  as  in  the  case  of  a  =  0),  the  previous 
discussion  on  utilizing  Theorem  4.4  to  guide  the  adjustment  of  parameters  would  be 
still  applicable  here.  Moreover,  as  shown  in  Section  5.1,  the  bounds  hinted  us  to  conduct 
some  deeper  analysis  that  leads  to  useful  insights. 

4.5.  Confirming  the  Thresholds  and  Examining  the  Bounds  via  Simulation 

We  conducted  a  simulation  study  to  confirm  the  analytical  results  by  using  two  real-life 
complex  network  datasets  obtained  from  http://snap.stanford.edu/data/. 

— Epinions  online  social  network  dataset.  This  is  a  directed  graph  corresponding  to 
a  real-life  online  social  network,  where  nodes  represent  people  and  arcs  represent 
interactions  between  them.  The  graph  has  75,879  nodes  and  508,837  arcs  with  aver¬ 
age  node  in-  and  out -degree  6.7059,  maximal  node  in-degree  3,035,  maximum  node 
out-degree  1,801,  and  M.a  =  106.53. 

— Enron  email  dataset.  This  is  a  graph  representing  Enron’s  internal  email  communi¬ 
cations,  where  nodes  represent  employees  and  arcs  represent  email  messages.  Unlike 
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Fig.  2.  The  case  of  Epinions  dataset  (/S  =  0.1  and  y  =  0.004). 


the  Epinions  dataset,  ( u ,  v)  e  E  always  implies  (v,u)  e  E  in  Enron  dataset  (that  is, 
the  interactions  between  a  pair  of  nodes,  if  any,  are  always  mutual),  which  allows  us 
to  treat  it  as  an  undirected  graph.  The  graph  has  36,692  nodes  and  367,662  edges 
with  average  degree  20.0404,  maximal  node  degree  2,766,  and  Ai.a  =  118.4177. 

Although  not  reported  here,  we  note  that  both  networks  exhibit  power-law  degree 
distributions.  The  reason  of  using  these  datasets  is  that  push-  and  pull-based  epidemic 
spreading  could  take  place  in  such  networks. 

Using  the  preceding  datasets,  we  compare  the  simulation  results  (averaged  over  500 
independent  runs)  of  the  global  mean  infection  rate  i(t)  =  ^  J^veV  iv(t);  the  average 

(over  all  nodes)  of  the  results  obtained  from  Eq.  (3.2);  the  average  (over  all  nodes)  of  the 
bounds  given  in  Theorem  4.4.  The  parameters  are  selected  according  to  Theorem  4.3  so 
that  the  spreading  will  become  stable.  In  all  cases,  we  let  20%  randomly  picked  nodes 
be  initially  infected.  For  the  sake  of  better  visual  effect,  we  use  higher  resolutions  for 
plotting  figures  whenever  possible. 

Using  the  Epinions  network  dataset,  Figure  2  plots  the  comparison  of  two  parameter 
scenarios;  in  both  cases  the  spreading  will  become  stable.  First,  we  note  that  the 
simulation  result  still  oscillates  a  little  bit  in  Figure  2(b)  because  the  y-axis  resolution 
is  very  high  ( 10  4).  Indeed,  we  calculated  the  standard  deviations  of  the  simulation 
results  in  both  scenarios,  and  found  that  the  standard  deviations  are  not  visually 
noticeable  (which  explains  why  we  did  not  plot  the  standard  deviations).  We  interpret 
(at  least  partially)  this  phenomenon  as  follows.  Since  the  initial  conditions  are  random, 
the  asymptotic  limit  of  the  infection  rate,  i*,  supposing  it  exists,  is  a  random  variable. 
Hence,  the  variance  of  the  mean  infection  rate  i  =  ^  Y,v=i  K 


If  var(i*)  is  bounded  from  above  by  some  positive  constant,  then  var(i)  converges  to 
zero  as  n  —*■  oo,  namely  as  the  size  of  the  network  goes  to  infinity.  Hence,  the  standard 
deviation,  y/ var{i),  converges  to  zero.  This  would  at  least  partially  interpret  why  the 
observed  standard  deviations  are  very  small.  Second,  even  if  in  Figure  2(a)  with  a  =  0.1, 
we  observe  that  the  lower  bounds  are  relatively  tight.  This  in  a  sense  confirms  the 
aforementioned  heuristics  that  when  the  infection  rate  is  far  from  1,  the  lower  bound 
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Fig.  3.  The  case  of  Enron  email  dataset  (fl  =  0.4  and  y  =  0.001). 


200 


would  be  relatively  tight.  Third,  it  is  noted  that  the  simulation  results  and  the  model 
predictions  are  almost  always  lumped  together. 

Similar  phenomena  are  observed  in  the  case  of  Enron  email  dataset  (Figure  3). 

5.  DEEPER  INSIGHTS 

In  this  section,  we  draw  deeper  insights  related  to  the  following  three  questions  when 
the  spreading  is  stable. 

— Does  deg(  v )  govern  i*  =  limy  >00  iv(t),  namely  what  role  do  node  degrees  play  in  terms 
of  determining  the  nodes’  infection  rates  (Section  5.1)? 

— Under  what  conditions  the  mean  field  approach  is  reasonably  accurate  and  thus 
applicable  to  arbitrary  networks  (Section  5.2)? 

— When  the  mean  field  approach  is  reasonably  accurate,  how  can  we  exploit  it  to  infer 
the  global  mean  infection  rate  through  localized  monitoring  (Section  5.3)?  This  is 
especially  important  because  it  does  not  require  knowledge  of  the  parameters. 


5.1.  The  Role  of  Node  Degree  on  Node  Infection  Rate 

Both  the  upper  and  lower  bounds  given  in  Theorem  4.4  are  a  function  of  deg(u).  This 
suggests  as  a  special  case  that  i*,  the  probability  that  node  v  is  infected  when  the 
spreading  becomes  stable,  may  be  dependent  upon  deg(c).  To  characterize  the  impact 
of  deg(i>)  on  i*,  we  start  with  an  observation.  Since  we  want  to  reveal  the  relation 
between  i*  and  deg(r),  we  treat  it  as  if  i*  depends  only  on  deg(r).  This  allows  us  to 
consider  the  probability  any  degree-^  node  is  infected  after  the  spreading  becomes 
stable.  Suppose,  for  feasibility,  all  of  v’s  neighboring  nodes  have  identical  infection  rate 
as  v,  namely  that  i*  ~  i*  for  all  u  with  ( u ,  v)  e  E.  From  master  equation  (3.2),  we  have 


= 


a-i*v)+a-  p)i*v 


i  —  (i  — «)  Y\  (i-K) 

(u,v)eE 

~  [1  -  (1  -  a)(l  -  yi*)k\ Id  -  i*v)  +  (1  -  p)i*. 
Hence,  solution  to  the  following  equation 

pi* (k)  =  [1  —  (1  —  «)(1  -  yi*mk)(l  -  i*m 


(5.1) 


(5.2) 
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is  the  probability  that  any  degree-^  node  is  infected.  In  other  words,  the  solution  to 
Eq.  (5.3)  approximates  i*(k).  We  have 

fix  —  [1  —  (1  —  a)(  1  —  yx)*](l  —  x)  (5.3) 

for  each  k  —  1,2 . max„€y{deg(iO}.  We  observe  that  Eq.  (5.3)  can  be  numerically 

solved  using  (for  example)  Matlab. 

In  order  to  evaluate  the  preceding  approximation-based  analytic  heuristics,  we  con¬ 
ducted  a  simulation  study.  In  order  to  attest  the  applicability  of  the  heuristics,  we 
further  used,  in  addition  to  the  aforementioned  Epinions  and  Enron  datasets,  the  fol¬ 
lowing  datasets  which  vary  in  network  topology  and/or  size. 

— Oregon  autonomous  system  dataset.  This  is  an  undirected  graph  of  real-life  network 
connections  between  Internet  autonomous  systems.  The  dataset  is  obtained  from 
Web  site  http://topology.eecs.umich.edu/data.html.  The  graph  has  11,461  nodes  and 
32,730  edges  with  average  node  degree  5.7115,  maximal  node  degree  2,432,  and 
largest  eigenvalue  Ii  a  =  75.2407.  It  also  exhibits  a  power-law  degree  distribution. 
— Power-law  graphs.  We  used  the  Brite  tool  [Medina  et  al.  2001]  to  generate  two 
synthetic  undirected  power-law  graphs.  The  small  one  has  2,000  nodes  and  5,997 
edges  with  average  node  degree  5.997,  maximal  node  degree  90,  and  M >A  =  11.6514. 
The  large  one  has  37,000  nodes  and  369,908  edges  with  average  node  degree  19.995, 
maximal  node  degree  1,530,  and  M  =  103.9362. 

— Random  graphs.  We  used  two  synthetic  undirected  Erdos-Renyi  graphs.  The  small 
one  has  2,000  nodes  and  6,001  edges  with  average  node  degree  6.001,  maximal  node 
degree  16,  and  M,a  =  7.1424.  The  large  one  has  37,000  nodes  and  370,000  edges  with 
average  node  degree  20,  maximal  node  degree  40,  and  Ai  =  21.0894. 

— Regular  graphs.  The  small  synthetic  undirected  regular  graph  has  2,000  nodes  and 
6,000  edges  with  node  degree  6  and  M.a  =  6.  The  large  synthetic  undirected  regular 
graph  has  37,000  nodes  and  370,000  edges  with  node  degree  20  and  Ai  =  20. 

In  our  simulations,  we  set  a  —  0.4,  /S  =  0.6  and  y  —  0.004,  which  always  satisfy  the 
sufficient  condition  given  in  Theorem  4.3  and  thus  the  spreading  will  become  stable. 
We  compare  in  Figures  4  and  5  the  i*(k)  obtained  from  simulation  and  the  i*(k)  obtained 
by  solving  Eq.  (5.3).  In  each  case,  the  x-axis  represents  degree  k  =  1, . . . ,  max{deg(r)}, 
and  the  y-axis  represents  simulation-based  i*(k)  (+  in  red  color),  which  is  obtained  by 
averaging  (over  100  simulation  runs)  the  i*’s  of  the  degree-^  nodes  after  the  spreading 
becomes  stable,  and  the  i*(k )  obtained  by  numerically  solving  Eq.  (5.3)  (x  in  black 
color). 

In  the  case  of  both  real  and  synthetic  power-law  graphs  (refer  to  Figure  4)  the 
numerical  results  fit  the  simulation  results  pretty  well,  although  they’re  not  perfect. 
This  means  that  the  approximation-based  Eq.  (5.3)  could  be  used  in  practice.  On  the 
other  hand,  we  observe  some  intriguing  phenomena. 

— In  the  large  real  or  synthetic  power-law  graphs  (Figures  4( a) — 4(d)),  node  degree  k 
has  a  nonlinear  factor  on  the  nodes’  infection  rate  i*(k )  whereas,  for  the  small  one 
(Figure  4(e)),  node  degree  has  a  linear  factor  on  i*(k).  How  can  we  interpret  this 
discrepancy?  We  observe  that  the  nonlinearity  appears  to  be  proportional  to  the 
range  of  the  node  degrees.  For  example,  the  nonlinearity  in  Figure  4(d)  appears  to 
be  in  between,  and  so  is  its  node  degree  interval.  This  phenomenon  arguably  can  be 
explained  as  that  a  nonlinear  function  could  seem  linear  in  a  sufficiently  small  local 
region  (e.g.,  degree  range  [1,  250]  in  Figures  4(a)— 4(d)). 

— We  observe  that  for  sufficient  large  degree  k  (e.g.,  when  k  >  1,  000),  i*{k)  becomes 
almost  flat.  This  can  be  explained  as  follows.  We  observe  that  the  asymptotic  limit 
lim^oo  i*(k)  —  1/(1  +  /!],  which  means  that  the  variation  of  the  slopes  of  the  nonlinear 
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Fig.  4.  On  the  accuracy  of  the  approximation:  Real  and  synthetic  power-law  graphs. 
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Fig.  5. 


On  the  accuracy  of  the  approximation:  Synthetic  random  graphs. 


curves  vanishes  rapidly  and  thus  the  curves  look  flat  because  (for  example)  the  limit 
value  is  0.635  for  =  0.6. 

In  the  case  of  synthetic  Erdos-Renyi  random  graphs  (Figure  5),  the  numerical  results 
and  the  simulation  results  agree  to  a  large  extent.  We  observe  that  node  degree  k  has 
a  linear  factor  on  i*(k).  Similar  to  the  case  of  power-law  graphs,  this  might  be  caused 
by  the  fact  that  the  degrees  are  densely  concentrated  on  a  small  interval,  in  which  a 
nonlinear  function  could  be  approximated  as  a  linear  one. 

In  the  case  of  synthetic  regular  graphs,  we  report  that  the  numerical  results  and 
the  simulation  results  match  nicely.  We  did  not  plot  the  figures  because  there  is  only  a 
single  point  in  each  figure,  which  does  not  offer  any  extra  value. 

In  summary,  we  draw  the  following. 

Insight  1.  Node  degree  plays  an  essential  role  in  governing  node  infection  rate. 
Moreover,  Eq.  (5.3)  is  a  reasonably  good  approximation  of  which  means  that  one 
could  use  Eq.  (5.3)  to  estimate  i*(k)  for  given  parameters. 

5.2.  Condition  under  which  the  Mean  Field  Approach  Is  Accurate 

In  this  section,  we  will  provide  further  analyses  of  the  push-  and  pull-based  epidemic 
models  using  the  well-known  mean  field  approach,  which  is  a  statistical  physics  method 
introduced  for  the  purpose  of  studying  stochastic  particle  systems  [Chandler  1987]. 
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Roughly  speaking,  the  mean  field  approach  is  to  use  the  mean  (expectation)  to  stand 
for  stochastic  interactions.  Mathematically  speaking,  it  can  be  seen  as  the  first-order 
approximation  of  random  variables.  To  be  concrete,  we  here  give  a  simple  example. 
Suppose  that  a  variable  w  is  influenced  by  its  random  environment,  which  is  described 
by  a  variable  x,  via  a  scheme  h.  That  is,  w  =  hix).  By  Taylor  expansion,  we  have 

w  —  hix )  =  h({x))  +  h'((x))(x  —  (x))  +  o(|x  —  (x}|), 

where  (•)  stands  for  the  expectation  (mean).  By  omitting  the  higher-order  terms  and 
conducting  expectation  on  both  sides,  we  have  an  approximation  ( w )  —  hi  (x) ).  This 
highlights  the  basic  idea  of  the  mean  field  approach:  replacing  a  random  field  with  its 
mean  field. 

The  mean  field  approach  also  has  become  an  important  method  for  understanding 
complex  dynamics.  In  general,  this  method  is,  however,  not  applicable  in  our  setting 
because  of  the  arbitrary  network  heterogeneity  we  accommodate.  Therefore,  it  is  nat¬ 
ural  to  ask  for  conditions  under  which  the  method  is  applicable.  In  what  follows  we 
address  this  question. 

We  first  derive  a  formula  to  describe  i,  the  global  mean  infection  rate  defined  after 
the  spreading  becomes  stable,  namely, 

1  1  veV  11  veV 

in  Erdos-Renyi  random  graph  with  edge  probability  p.  Recall  the  equation  for  the  fixed 
point  of  master  equation  (3.2) 


So,  we  have 


#;  = 


i — (i — a)  n  (i  -Yi*u) 

( u,v)eE 


(i-O- 


l  —  (i  —  a)  Y\  d-KP 

(u,v)eE 


We  use  the  following  approximation  to  derive  the  right-hand  side  term. 


1  —  (1  —  a)  Y\  d-K) 

(u,v)eE 

In  Erdos-Renyi  random  graphs  with  edge  probability  p,  every  pair  of  nodes  are  linked 
with  probability  p.  So,  we  can  approximate  the  following  term  as 

n  (i-K)W-m)n 

( u,v)eE  I 

Let  ( k )  represent  the  average  degree.  When  y  (k)  is  sufficiently  small,  we  have 
log[(l  -  ypi)n]  —  7ilog[l  -  ypi]  -npyi  =  -y{k)i. 

Thus,  we  have 

n  (1  ~  ^  exp(-y(k)i). 

( u,v)eE  I 


(i-C) 


(l-a)l  [~[  (1  -yi*) 

\(u,v)eE  I 


(1  -i) 
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(a)  Epinions  dataset 


(b)  Enron  dataset 


(c)  Oregon  dataset 


(d)  Power-law  graph  (small) 


(e)  Random  graph  (small)  (f)  Regular  graph  (small) 


Fig.  6.  Mean-field  approximation  via  Eq.  (5.4)  (the  curve)  vs.  simulation  (the  dots):  f$  =  0.4  and  y  =  0.004. 
Observe  that  i  increases  with  respect  to  a  in  a  concave  fashion. 


So,  we  have 

pi  [1  —  (1  —  a)exp(— (k)yi)\(l  —  i).  (5.4) 

Its  solution  is  i(a,  ft,  y,  (k)).  Although  we  are  unable  to  derive  its  analytic  expression, 
we  can  calculate  the  solution  numerically. 

The  previous  discussion  was  made  for  Erdos-Renyi  random  graphs  with  edge  proba¬ 
bility  p  with  small  y(k}.  Can  it  be  applied  to  other  network  topologies,  perhaps  under 
the  condition  y  (k)  is  small?  To  answer  this  question,  we  conduct  simulation  using  some 
of  the  aforementioned  datasets,  which  are  selected  with  a  certain  balance  on  network 
size  and  topology  (three  large  real  datasets,  three  small  synthetic  datasets).  Simulation 
results  are  based  on  the  average  of  50  runs. 

Figure  6  compares  the  i  derived  from  the  simulation  results  after  the  spreading 
becomes  stable  (the  dots)  and  the  numerical  solution  i  of  Eq.  (5.4)  with  fixed  /i  =  0.4 
and  y  =  0.004.  In  the  case  of  the  small  synthetic  Erdos-Renyi  random  graph,  for  which 
Eq.  (5.4)  was  derived,  we  observe  that  the  simulation  result  and  the  numerical  result 
match  perfectly.  Moreover,  for  the  other  network  graphs,  they  also  match  nicely.  In  all 
cases,  the  global  mean  infection  rate  i  increases  with  respect  to  a  in  a  concave  fashion. 
This  means  that  in  order  to  significantly  lower  the  global  mean  infection  rate,  a  must 
be  decreased  substantially  (e.g.,  from  a  =  0.6  down  to  a  =  0.2).  In  other  words,  security 
against  pull-based  infection  such  as  “drive-by  download”  attacks  must  be  substantially 
enhanced;  otherwise,  the  spreading  will  remain  at  a  high  level. 

Figure  7  compares  the  i  derived  from  the  simulation  results  after  the  spreading 
becomes  stable  (the  dots)  and  the  numerical  solution  i  to  Eq.  (5.4)  with  fixed  a  =  0.1 
and  y  =  0.004.  In  the  case  of  Erdos-Renyi  random  graph,  the  simulation  result  and  the 
numerical  result  match  perfectly.  For  other  network  graphs,  they  also  match  well.  In  all 
cases,  the  global  mean  infection  rate  i  decreases  with  respect  to  f>  in  a  convex  fashion. 
This  means  that  in  order  to  lower  the  global  mean  infection  rate  significantly,  ft  must  be 
increased  substantially  (e.g.,  from  ft  =  0.1  to  fS  =  0.5).  In  other  words,  reactive  security 
mechanisms  against  push-  and  pull-based  infections  must  be  substantially  enhanced; 
otherwise,  the  spreading  will  remain  at  a  relatively  high  level. 
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Fig.  7.  Mean-field  approximation  via  Eq.  (5.4)  (the  curve)  vs.  simulation  (the  dots):  a  =  0.1  and  y  =  0.004. 
Observe  that  i  decreases  with  respect  to  /3  in  a  convex  fashion. 


Figure  8  compares  the  i  derived  from  the  simulation  results  after  the  spreading 
becomes  stable  (the  dots)  and  the  numerical  solution  i  to  Eq.  (5.4)  with  fixed  a  =  0.1  and 
P  =  0.4.  In  the  case  of  random  graph,  the  simulation  result  agrees  with  the  numerical 
result.  This  also  applies  to  the  case  of  regular  graph  topology.  In  both  cases,  the  global 
mean  infection  rate  i  increases  with  respect  to  y  in  a  somewhat  linear  fashion.  However, 
for  power-law  topologies,  the  simulation  result  matches  the  numerical  solution  only 
when  y  (therefore,  y{k))  is  small;  this  confirms  the  condition  under  which  the  mean 
field  approach  is  applicable  to  arbitrary  networks.  In  any  case,  we  observe  that  any 
enhancement  in  security  mechanisms  against  spreading  (e.g.,  blocking  suspicious  or 
malicious  sessions)  will  always  have  an  almost  linear  impact  on  the  mean  infection  rate. 
This  is  quite  different  from  the  impact  of  a  and  p  because  it  states  that  network-based 
defense  is  always  effective. 

The  preceding  discussions  lead  us  to  draw  the  following. 

Insight  2.  The  mean  field  approach  is  useful  in  the  setting  of  arbitrarily  hetero¬ 
geneous  networks  when  y{k)  is  small.  Under  this  circumstance,  we  found  that  the 
impacts  of  the  parameters  a,  ft,  and  y  are  very  different  (as  described  before). 

5.3.  Peeking  into  Global  Infection  via  Localized  Monitoring 

Estimating  i  via  localized  monitoring  with  unknown  parameters.  In  the  preceding  we 
showed  that  when  y(k)  is  small,  the  solution  to  Eq.  (5.4)  is  a  good  approximation  of 
the  global  mean  infection  rate  i,  Still,  it  requires  to  know  the  values  of  parameters  a, 
P,  and  y.  This  naturally  leads  to  another  question:  What  can  we  say  or  (how)  can  we 
obtain/approximate  i  even  if  the  parameters  are  not  known?  In  theory,  we  can  derive  i 
by  monitoring  the  entire  network  so  as  to  count  the  number  of  infected  nodes  at  every 
time  t.  This  is  clearly  infeasible  in  practice.  It  then  becomes  natural  to  ask:  How  can 
we  sample  the  network  so  as  to  approximate  i  with  a  reasonable  precision?  Because 
the  states  of  the  nodes  are  not  independent  of  each  other,  rather  they  are  correlated 
in  a  certain  delicate  fashion  as  governed  by  the  master  equation  (3.2),  a  reasonable 
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(a)  Epinions  dataset 


(b)  Enron  dataset 


(c)  Oregon  dataset 


(d)  Power-law  graph  (small) 


(f)  Regular  graph  (small) 


Fig.  8.  Mean-field  approximation  via  Eq.  (5.4)  (the  curve)  vs.  simulation  results  (the  dots):  a  =  0.1  and 
ft  =  0.4.  Observe  that  i  increases  with  respect  to  y  in  a  somewhat  linear  fashion. 


sampling  strategy  must  take  into  consideration  the  evolution  of  the  dynamics.  In  what 
follows  we  present  such  a  result.  To  our  surprise,  the  sample  size  can  be  constant  and 
independent  of  the  size  of  the  network  in  question. 

Recall  that  i*  is  the  infection  rate  of  node  v  after  the  spreading  becomes  stable  and 
i*ik)  is  the  mean  infection  rate  of  the  nodes  with  degree  k.  Then,  we  have 


1 

|{n  :  deg(r)  =  k}\ 


X!  C  =  (i* I  deg(r)  =  £>„ 
deg(i>)=& 


where  (£(w)|r>u  denotes  the  conditional  expectation  of  random  variable  £(u)  over  u  e  V 
under  condition  T  (the  subscript  u  can  be  neglected  if  it  is  clear  from  the  context).  Recall 
Eq.  (5.2),  namely  the  approximation 

pi*  m  *  [i  -  (i  -  «xi  -  Yi*mk  Ki  -  i*m. 

Thus,  the  global  mean  infection  rate  equals  to  the  mean  of  i*(k)  over  the  degree  distri¬ 
bution.  We  have 


i  =  (i*v)v  = 

due  to  the  property  of  conditional  expectation  that  (£)  =  ((£|  r]))  for  two  arbitrary 
random  variables  £  and  ij.  With  linear  approximation,  we  have 

di*(k) 

i*(k)  —  i*({k))  H - ——  (k  -  ( k))  +  o{k  -  ( k )), 

dk 

which  leads  to 

i  =  (i*(k))k  =  i*m )  +  -  (A))J  +  [oik  -  (k)\  . 
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(a)  Epinions  dataset 


(b)  Enron  dataset 


(c)  Oregon  dataset 
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Fig.  9.  i(t)  vs.  iavg(f). 


Neglecting  the  higher-order  terms,  assuming 


[di*(k ) 
\  dk 


(k-(k))} 

Ik 


idi*m 

\  dk  Jk 


(Ik  -  {k)))k. 


and  because  ( k  —  {k))k  —  0,  we  have  the  following  approximation 

i  ~  i*((k)). 


This  leads  us  to  draw  the  following. 

Insight  3.  We  can  measure  (or  approximate)  i,  the  global  mean  infection  rate  in  a 
network,  through  the  mean  infection  rate  of  the  nodes  with  the  average  node  degree  of 
a  network. 


To  confirm  the  previous  insight,  we  use  simulation  to  compare  the  global  mean 
infection  rate  i  and  the  mean  infection  rate  of  the  average-degree  nodes,  denoted  by 
iavg-  We  plot  the  results  in  Figure  9.  The  parameters  used  for  simulation  are  a  —  0.4, 
f>  =  0.4,  and  y  =  0.001;  in  each  of  these  cases,  the  spreading  is  stable. 

Figure  9  does  show  that  they  are  reasonably  close  to  each  other  at  the  resolution  of 
10~2.  The  preceding  insight  is  useful  not  only  because  the  number  of  nodes  with  the 
average  degree  is  much  smaller  than  the  total  number  of  nodes,  but  also  because  we  do 
not  need  to  know  the  parameters  a,  f>,  and  y.  In  other  words,  it  states  that  monitoring 
the  nodes  with  the  average  degree  is  sufficient  to  measure  or  approximate  the  global 
state  of  the  network;  this  explains  why  we  call  it  localized  monitoring. 

Estimating  i  via  localized  monitoring  with  further  reduced  sample  size.  While  the 
previous  insight  is  already  very  useful,  we  ask  a  further  deeper  question:  What  if  the 
number  of  nodes  with  the  average  degree  is  still  large  (e.g.,  thousands)?  For  example,  in 
the  case  of  the  Epinions  dataset,  there  are  1,070  nodes  (that  is,  1.41%)  whose  degrees 
equal  to  the  average  degree;  in  the  case  of  Enron  dataset,  there  are  2,212  nodes  (that  is, 
6.03%)  whose  degrees  equal  to  the  average  degree.  Monitoring  and  examining  whether 
those  nodes  are  compromised  at  every  time  t  is  still  costly.  Is  it  possible  to  monitor  even 
smaller  number  of  nodes  (ideally,  constant  number  of  nodes  independent  of  the  size  of 
the  network)?  In  what  follows  we  answer  this  question  affirmatively. 
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For  the  purpose  of  further  reducing  the  number  of  nodes  to  monitor,  we  need  to  find 
further  approximation  to  i*((k)).  From  the  equations  of  the  equilibrium. 

_K_  =  1  -  (1  -  «)  Y\  d  -  Yi*u),  v  e  V ,  (5.5) 

lv  (u.v)eE 

we  know  that  the  infection  rate  of  node  v  depends  on  those  of  its  neighboring  nodes.  For 
the  same  reason,  the  infection  rate  of  node  u,  which  is  a  neighbor  of  node  v,  depends  on 
the  state  of  u’s  neighboring  nodes.  Thus,  we  can  regard  i*  as  a  function  of  v’s  second- 
order  neighborhood,  which  consists  of  nodes  r  such  that  r  ^  v  A  (r,  u)  e  E  A  (u,  v)  e  E. 
More  precisely,  we  define  the  second-order  degree  of  node  v,  denoted  by  2deg(i>),  as 

2deg(i>)  =  ^  deg(w)  —  \{w  :  (w,  v)  e  E  a  (i>,  w )  e  E}\. 

u:(u,v)eE 

This  means  that  when  two  different  neighbors  of  node  v  have  a  common  neighbor  r 
besides  v  itself,  r’s  contribution  to  the  infection  rate  of  v  should  be  counted  twice.  This 
leads  us  to  consider  the  first-  and  second-order  neighborhood  of  the  nodes  with  the 
average  degree  {k).  With  the  mean  field  approach,  we  have  the  following  approximation. 

1  -  it 

and 

( u ,  v)  e  E 

«)(1  -  X(i;>.:^,A(r,«)e£A(U,„)e£)<deg(“)l(“'U)e£> 

As  a  result,  we  can  further  approximately  regard  i*  with  deg(r)  =  (k)  as  a  random 
variable  over 


P{i*\(u,  v )  €  E) 

1  -  v )  e  E) 


K 


1~l'u 

-(1- 


2deg(v)  =  [(deg(w)|(w,  v)  e  E)u  -  1]  deg(i>). 

Assuming  (i*\(r,  u)  e  E ,  ( u ,  v)  e  E,  deg(r)  =  (k))v  =  t*  and  letting  iv  —  (i*|(w,  v)  e  E),  we 
have 

and  =  1  —  (1  —  ot)(l  —  yi*)k'\ 

1  ~iv 

2deg(n) 

k 

Given  the  mean  degree  (k),  we  can  regard  i*  as  a  function  of  random  variable  k'v,  that 
is,  i*  =  f(k')  for  all  v  with  deg(i>)  =  k  and  for  some  function  /(•)■  By  a  discussion 
similar  to  before,  we  have  an  approximation  i  ^  f((k'))  =  f({ 2deg(r))„:deg(u)=*/(deg(r))). 
This  means  that  we  can  select  the  nodes  whose  degrees  equal  to  ( k )  and  second-order 
degrees  equal  to  (2deg(r)|deg(v)  =  (k)).  This  leads  us  to  draw  the  following  insight, 
which  allows  to  further  reduce  the  sample  size. 

Insight  4.  We  can  approximate  i ,  the  global  mean  infection  rate,  by  only  monitoring 
the  nodes  whose  degrees  equal  to  the  average  degree  ( k )  and  second-order  degrees  equal 
to  (2deg(r)|  deg(r)  =  (k)). 

The  preceding  insight  statistically  states  that  if  there  are  no  or  few  nodes  whose 
degrees  and  second-order  degrees  satisfy  the  requirements,  then  we  can  select  the 


K 


i  -  a 


=  1  —  (1  —  «)(1  -Y~iv)(k) 


where 
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(a)  Epinions  dataset  (b)  Enron  dataset  (c)  Oregon  dataset 


(d)  Power-law  graph  (small)  (e)  Random  graph  (small)  (f)  Regular  graph  (small) 


Fig.  10.  t2avg-x[10, 190]  -  *[10, 190]. 


nodes  whose  degrees  and  second-order  degrees  are  close  to  the  respective  average 
degrees.  Because  the  insights  do  not  specify  how  large  the  reduced  sample  should  be, 
in  what  follows  we  use  simulation  to  demonstrate  how  large  the  samples  need  be. 

Let  2avg-x  denote  the  set  of  x  nodes  whose  degrees  equal  to  the  average  degree  and 
second-order  degrees  are  the  closest  to  (2deg(i>)|  deg(v)  =  (k)).  We  note  that  there  are 
two  ways  to  approximate  i  via  i2avg-x- 

—We  approximate  ~i[t0,ti\  =  as  Wxto.fd  =  s=S+I  DtstoA] 

J2we2ava-x  i-wit).  This  approximation  is  meaningful  because  the  former  captures  the 
global  mean  infection  rate  averaged  over  time  interval  [to,  til  and  the  latter  captures 
the  mean  infection  rate  of  the  x  sampled  nodes  over  the  same  period  of  time.  Note 
that  to  >  0  is  used  because  we  deal  with  the  situation  after  the  spreading  becomes 
stable;  our  simulation  result  indicates  that  to  can  be  no  greater  than  10. 

—We  approximate  i(t)  via  i2avg-x[to,  t]  =  r^rTT  L/fc[4„/.|  Ewe2avg-x  twit).  This  approxima¬ 
tion  is  useful  and  interesting  because  the  latter  captures  the  “history”  of  the  small 
samples,  which  can  actually  approximate  a  certain  index  of  the  whole  network. 

Corresponding  to  the  first  type  of  approximation  mentioned  earlier,  Figure  10  plots 
i2avg-x[to,  h]  —  i [to,  ti]  over  time  interval  [to,  h]  =  [10, 190]  (with  step-length  10)  based 
on  our  simulation  study,  where  x  e  [2,  4,  8, 16,  32,  64, 128,  256}  nodes  are  selected  as 
indicated  before.  We  observe  that  in  all  the  plotted  network  graphs,  16  nodes  are  enough 
to  form  a  sufficiently  accurate  sample.  What  is  surprising  is  that  it  seems  a  constant 
size  of  samples  is  sufficient,  regardless  of  the  size  of  the  graph.  It  is  an  exciting  future 
work  to  analytically  explain  this  phenomenon. 

Corresponding  to  the  second  type  of  approximation  mentioned  earlier,  Figure  11 
plots  i2avg-xho,  t ]  versus.  i(t),  where  to  =  10 ,t  =  20,  30, . . . ,  200,  and  x  e  [2,  8,  32}  nodes 
are  selected  as  indicated  previously.  Note  that  the  accuracy  shown  in  Figure  11  is 
not  as  good  as  the  accuracy  shown  in  Figure  10.  This  is  because  Figure  10  corre¬ 
sponds  to  the  average  over  time  interval  [10, 190],  where  the  comparison  is  between 
z [10, 190]  and  i2avg-x[10, 190],  whereas  Figure  11  corresponds  to  i(t)  and  i2avg-x[10,  t ]  for 
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t  =  10,  20, . . . ,  190.  Nevertheless,  as  shown  in  Figure  11,  for  moderate  t  (e.g.,  t  =  120), 
the  prediction  of  the  future  is  with  reasonable  accuracy.  This  leads  to  the  following. 

Insight  5.  The  accumulative  observations  obtained  from  monitoring  certain  con¬ 
stant  number  of  nodes  (selected  as  mentioned  before,  for  example,  monitoring  32  nodes 
as  shown  in  Figure  11)  allow  to  estimate  the  current  global  mean  infection  rate  i(t) 
with  a  pretty  good  accuracy. 


6.  CONCLUSION  AND  FUTURE  WORK 

We  rigorously  investigated  a  push-  and  pull-based  epidemic  spreading  model  in  arbi¬ 
trary  networks.  We  presented  sufficient  conditions  or  epidemic  thresholds  under  which 
the  spreading  will  become  stable  (but  not  dying  out).  The  conditions  supersede  the 
relevant  ones  offered  in  the  literature.  Our  investigation  leads  to  further  insights  re¬ 
garding  the  characterization  of  the  role  of  node  degree  in  governing  node  infection  rate, 
the  characterization  of  a  condition  under  which  the  mean  field  approach  is  applicable 
in  arbitrary  networks,  and  a  sampling  strategy  for  selecting  a  small  number  of  nodes 
to  monitor  so  as  to  estimate  the  global  mean  infection  rate  without  knowing  the  values 
of  the  parameters. 

There  are  many  exciting  research  problems  from  both  theoretic  and  practical  points 
of  view.  In  addition  to  those  mentioned  in  the  text,  here  we  highlight  the  following.  First, 
how  can  we  precisely  pin  down  the  equilibrium  states  in  the  general  case  of  a  >  0? 
Second,  can  we  have  tighter  bounds  especially  when  the  spreading  is  not  stable?  Third, 
can  we  statistically  prove  that  the  sample  size  can  be  constant  and  thus  independent 
of  the  size  of  the  network  (as  indicated  by  our  empirical  study)?  Fourth,  can  we  design 
sampling  algorithms  to  sample  the  nodes  without  knowing  the  network  topology  (our 
characterization  demonstrates  their  existence  but  requires  the  knowledge  of  the  global 
network)? 
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Theoretical  modeling  of  computer  virus/worm  epidemic  dynamics  is  an  important  problem  that 
has  attracted  many  studies.  However,  most  existing  models  are  adapted  from  biological  epi¬ 
demic  ones.  Although  biological  epidemic  models  can  certainly  be  adapted  to  capture  some  com¬ 
puter  virus  spreading  scenarios  (especially  when  the  so-called  homogeneity  assumption  holds), 
the  problem  of  computer  virus  spreading  is  not  well  understood  because  it  has  many  important 
perspectives  that  are  not  necessarily  accommodated  in  the  biological  epidemic  models.  In  this 
paper  we  initiate  the  study  of  such  a  perspective,  namely  that  of  adaptive  defense  against  epidemic 
spreading  in  arbitrary  networks.  More  specifically,  we  investigate  a  non- homogeneous  Susceptible- 
Infectious-Susceptible  (SIS)  model  where  the  model  parameters  may  vary  with  respect  to  time. 
In  particular,  we  focus  on  two  scenarios  we  call  semi-adaptive  defense  and  fully -adaptive  defense, 
which  accommodate  implicit  and  explicit  dependency  relationships  between  the  model  parame¬ 
ters,  respectively.  In  the  semi-adaptive  defense  scenario,  the  model’s  input  parameters  are  given; 
the  defense  is  semi- adaptive  because  the  adjustment  is  implicitly  dependent  upon  the  outcome  of 
virus  spreading.  For  this  scenario,  we  present  a  set  of  sufficient  conditions  (some  are  more  general 
or  succinct  than  others)  under  which  the  virus  spreading  will  die  out;  such  sufficient  conditions 
are  also  known  as  epidemic  thresholds  in  the  literature.  In  the  fully-adaptive  defense  scenario, 
some  input  parameters  are  not  known  (i.e.,  the  aforementioned  sufficient  conditions  are  not  appli¬ 
cable)  but  the  defender  can  observe  the  outcome  of  virus  spreading.  For  this  scenario,  we  present 
adaptive  control  strategies  under  which  the  virus  spreading  will  die  out  or  will  be  contained  to  a 
desired  level. 

Categories  and  Subject  Descriptors:  K.6.5  [Management  of  Computing  and  Information  Systems]:  Security 
and  Protection 
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Author’s  address:  Shouhuai  Xu,  Li  Xu,  and  Zhenxin  Zhan  are  with  the  Department  of  Computer  Science,  Uni¬ 
versity  of  Texas  at  San  Antonio.  Corresponding  author:  Shouhuai  Xu  (shxu@cs  .  utsa  .  edu). 

Wenlian  Lu  is  with  the  Center  for  Computational  Systems  Biology  and  the  School  of  Mathematical  Sciences, 
Fudan  University  (wen li an @  fudan  .  edu  .  cn). 

Permission  to  make  digital/hard  copy  of  all  or  part  of  this  material  without  fee  for  personal  or  classroom  use 
provided  that  the  copies  are  not  made  or  distributed  for  profit  or  commercial  advantage,  the  ACM  copyright/server 
notice,  the  title  of  the  publication,  and  its  date  appeal',  and  notice  is  given  that  copying  is  by  permission  of  the 
ACM,  Inc.  To  copy  otherwise,  to  republish,  to  post  on  servers,  or  to  redistribute  to  lists  requires  prior  specific 
permission  and/or  a  fee. 

©  2001  ACM  1529-3785/2001/0700-0001  $5.00 


ACM  Transactions  on  Computational  Logic,  Vol.  2,  No.  3,  09  2001,  Pages  1-20. 


2 


Shouhuai  Xu  et  al. 


1.  INTRODUCTION 

Theoretically  modeling  the  spreading  dynamics  of  computer  vims  (or  malware  such  as 
worm  and  hot)  is  important  for  deepening  our  understanding  and  for  designing  effective,  if 
not  optimal,  defenses.  We  observe,  however,  that  the  utility  of  theoretical  modeling  in  this 
context  is  not  well  understood  yet  because  existing  models  are  often  adapted  from  biolog¬ 
ical  epidemic  ones.  As  a  consequence,  many  existing  models  of  computer  virus  spreading 
dynamics  made  the  so-called  homogeneity  assumption,  which  roughly  says  that  the  nodes 
are  equally  powerful  in  infecting  others.  Realizing  the  limitation  of  the  assumption,  there 
have  been  investigations  that  aim  to  weaken  the  assumption  by  considering  heterogeneous 
network  topology  (where  different  nodes  may  have  different  infection  capabilities  because 
they  have  different  degrees).  Along  this  line  of  study,  the  present  paper  moves  a  step  fur¬ 
ther  by  exploring  models  that  accommodate  realistic  scenarios  where  the  model  parameters 
may  change  over  time  (i.e.,  the  parameters  are  some  functions  of  time),  which  captures  the 
fact  that  both  attack  and  defense  are  dynamically  evolving  or  under  dynamical  adjustment 
and  reflects  the  persistence  of  virus  spreading.  This  allows  us  to  investigate  an  impor¬ 
tant  and  novel  perspective  of  virus  spreading-defense  dynamics,  namely  that  of  adaptive 
defense  against  computer  virus  spreading. 

1.1  Our  Contributions 

We  investigate  a  non-homogeneous  Susceptible-Infectious-Susceptible  (SIS)  model  in  ar¬ 
bitrary  networks  (i.e.,  there  is  no  restriction  on  the  topology  of  the  spreading  networks  and 
the  nodes  may  have  different  defense  or  cure  capabilities).  The  model  can  accommodate 
both  semi-adaptive  defense  and  fully-adaptive  defense.  In  the  semi-adaptive  defense  sce¬ 
nario,  the  input  parameters  in  the  model  are  known  and  can  vary  with  respect  to  time  (e.g., 
according  to  some  deterministic  functions  of  time  or  according  to  some  stochastic  process, 
but  we  do  not  impose  any  practical  restrictions  on  the  types  of  functions).  For  this  scenario, 
we  present  a  set  of  sufficient  conditions,  from  general  to  specific  (but  more  succinct),  un¬ 
der  which  the  virus  spreading  will  die  out.  We  note  that  such  sufficient  conditions  are  also 
known  as  epidemic  thresholds  in  the  literature. 

In  the  fully-adaptive  defense  scenario,  some  input  parameters  are  not  known  and  thus  the 
aforementioned  sufficient  conditions  are  not  applicable.  Nevertheless,  the  defender  might 
be  able  to  observe  the  outcome  of  virus  spreading  (i.e.,  which  nodes  are  infected  at  a  point 
in  time).  For  this  scenario,  we  present  adaptive  control  strategies  under  which  the  virus 
spreading  will  die  out  or  will  be  contained  to  a  desired  level  (which  is  important  when,  for 
example,  the  price  to  kill  the  virus  spreading  may  be  too  high). 

Because  of  the  above,  our  model  supersedes  previous  homogeneous  and  non-homogeneous 
models  that  offered  relevant  analytical  insights;  the  concrete  connection  will  be  made  when 
the  need  arises.  Our  analytical  results  are  confirmed  via  simulation,  from  which  we  draw 
additional  observations  that  serve  as  hints  for  future  modeling  studies.  We  discuss  the 
practical  implications  of  our  model  and  the  derived  insights  as  well. 

Finally,  we  note  that  the  present  paper  is  meant  to  explore  theoretical  characterizations 
of  spreading-defense  dynamics  while  assuming  certain  parameters  can  be  observed  or  mea¬ 
sured  (e.g.,  based  on  extensive  data  and  possibly  expert  knowledge).  This  may  not  be  fea¬ 
sible  some  times.  Regardless,  we  believe  that  such  studies  are  important  on  their  own  and 
represent  a  necessary  step  towards  the  ultimate  characterization  of  virus  spreading-defense 
dynamics  (which  in  turn  helps  design  more  effective  or  even  optimal  defenses). 
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1.2  Related  Work 

To  the  best  of  our  knowledge,  there  are  no  existing  studies  on  modeling  adaptive  spreading- 
defense  dynamics  in  arbitrary  networks.  The  work  that  is  most  closely  related  to  ours 
was  due  to  Chakrabarti  et  al.  [Chakrabarti  et  al.  2008],  who  considered  computer  virus 
spreading  in  arbitrary  networks  —  a  scenario  also  investigated  in  [Wang  et  al.  2003;  Ganesh 
et  al.  2005].  The  most  important  contribution  of  these  studies  is  the  identification  of  a 
sufficient  condition  (i.e.,  epidemic  threshold)  under  which  the  virus  spreading  will  die  out; 
we  will  discuss  the  relationship  between  their  result  and  ours  when  the  need  arises.  Earlier 
studies  either  made  the  homogeneity  assumption  (e.g.,  [Kephart  and  White  1991;  1993]) 
as  in  biological  epidemic  models  (see,  for  example,  [McKendrick  1926;  Kermack  and 
McKendrick  1927;  Bailey  1975;  Anderson  and  May  1991;  Hethcote  2000]),  or  considered 
specific  non-homogeneous  networks  [Chakrabarti  et  al.  2008]. 

We  should  mention  prior  work  that  is  conceptually  or  spiritually  relevant.  The  concept  of 
“adaptable  robust  computer  systems”  was  investigated  by  Bhargava  et  al.  [Bhargava  et  al. 
1986],  which  however  has  a  very  different  meaning  and  is  for  very  different  purposes. 
Also  for  a  different  purpose,  Zou  et  al.  [Zou  et  al.  2005]  explored  the  concept  of  “adaptive 
defense”  based  on  cost  optimization,  where  cost  was  introduced  by  false  positives  and  false 
negatives.  In  particular,  they  considered  optimal  adaptive  defense  against  worm  infection, 
but  is  from  the  perspective  of  decision  whether  or  not  to  block/allow  some  specific  host 
traffic.  As  such,  it  may  be  possible  to  combine  their  studies  and  ours  because  we  do  not 
consider  cost. 

Outline:  In  Section  2  we  present  our  model  as  well  as  the  analytical  insights.  We  report 
our  simulation  study  in  Section  3.  We  conclude  the  paper  in  Section  4  with  open  problems. 

2.  ADAPTIVE  EPIDEMIC  DYNAMICS:  MODEL  AND  ANALYSIS 
2.1  The  model 

Primary  parameters.  Because  we  want  to  accommodate  spreading  in  arbitrary  networks, 
we  assume  that  virus  spreads  over  a  series  of  finite,  dynamical  graphs  G(t)  =  ( V. ,  E(t)), 
where  V,  |  V\  =  n,  is  the  set  of  nodes  or  vertices  and  E(t)  is  the  set  of  (possibly  changing) 
edges  or  arcs  at  time  t  >  0  (i.e.,  the  topology  may  change  with  respect  to  time).  At  any  time 
t,  an  infected  node  u  can  directly  infect  node  v  if  (it,  v )  £  E(t).  Denote  by  A(t)  =  [ avu(t )] 
the  adjacency  matrix  of  G(t ),  where  avv(t )  =  0  for  all  v  £  V,  and  avu(t )  =  1  if  and  only 
if  (it,  v)  £  E(t).  Note  that  this  representation  naturally  accommodates  both  directed  and 
undirected  topologies,  and  thus  our  results  equally  apply  to  them. 

A  node  v  £  V  is  susceptible  if  v  is  secure  but  vulnerable,  and  infected  if  v  is  suc¬ 
cessfully  attacked  (i.e.,  infected  and  infectious).  At  any  time  t,  a  node  v  £  V  is  either 
susceptible  or  infected.  Moreover,  a  susceptible  node  may  become  infected  because  of 
some  infected  node  u  where  (it,  v)  £  E(t),  and  an  infected  node  may  become  suscep¬ 
tible  because  of  cure.  Since  an  infected  node  may  become  susceptible  again,  our  model 
falls  into  the  category  of  the  so-called  SIS  models,  but  our  model  has  the  unique  feature 
that  values  of  the  parameters  can  change  with  respect  to  time. 

We  consider  two  dependent  variables:  sv(t),  the  probability  v  £  V  is  susceptible  at 
time  t;  iv(t),  the  probability  v  £  V  is  infected  at  time  t.  We  consider  a  continuous-time 
model,  which  preserves  the  invariant  sv(t)  +  iv{t)  =  1.  The  model’s  input  parameters  are: 

— j3v(t ):  The  probability  an  infected  node  v  becomes  susceptible  at  time  t. 
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— Juv(t)'-  The  probability  an  infected  node  it  successfully  infects  a  susceptible  node  v 
over  edge  (u,  v)  £  E(t)  at  time  t.  For  simplicity,  we  assume  that  7u„(f)  =  7  (t)  for  all 
(it,  v)  £  E(t). 

For  the  sake  of  mathematical  rigorousness,  the  f3v(t)’s  and  the  7u„(f)’s,  which  are  proba¬ 
bilities,  should  be  “measurable”  so  as  to  ensure  the  existence  of  solutions  to  system  (2.4) 
and  be  “bounded”  so  as  to  ensure  the  proof  of  Theorem  2  can  get  through.  To  avoid  any  un¬ 
necessary  mathematical  subtleties,  we  simply  assume  that  these  parameters  are  “boundedly 
measurable,”  which  has  no  consequence  in  practice.  Note  that  A(t)  is  naturally  bounded. 

Other  parameters  and  notations.  Below  is  a  summary  of  the  major  notations  used  in  the 
paper;  notations  only  occasionally  used  are  explained  when  the  need  arises. 


model  input  parameters: 

A(t)  =  [avu(t)] 

Pv{t) 

7  it) 

the  adjacency  matrix  of  graph  G{t)  =  (V,  E(t))  where 
\V\  =  n,  and  avu(t)  =  1  if  and  only  if  (u,  v)  £  E{t). 
Moreover,  avv  ( t )  =  0  for  all  v  £  V . 
the  cure  capability  of  node  v  at  time  t 
the  edge  infection  capability  at  time  t 

dependent  variables: 

Sv(t) 

the  probability  node  v  £  V  is  susceptible  at  time  t 

iy(t) 

the  probability  node  v  £  V  is  infected  at  time  t 

intermediate  variables: 

5v(t) 

the  probability  susceptible  v  £  V  becomes  infected 
at  time  t  because  of  infected  neighbors  {it  :  (it,  v)  £ 
E(t)} 

other  notations: 

Ai 

the  largest  (in  modulus)  eigenvalue  of  adjacency  ma¬ 
trix  A 

HI 

the  1-norm  of  vector  or  matrix 

B(t)  =  diag[/3i(f),  •  •  •  ,/3„(t)] 

the  cure  probability  diagonal  matrix 

The  state  transition  diagram  and  master  equation.  Figure  1  depicts  the  state  transition 
diagram  of  a  node,  where  the  probability  Sv  ( t )  is  given  by 


$v{t)  =  1-  II  [1  -  7(f)  •  .  (2.1) 

(u,v)£E(t) 


Fig.  1 .  State  transition  diagram  of  node  v  £  V  at  time  t 


Note  that  in  the  derivation  of  Eq.  (2.1),  we  assumed  that  the  events  that  infected  neigh¬ 
bors  infect  a  node  are  independent.  Note  also  that  sv  (t)  +  iv  (t)  =  1  for  any  t.  Based  on 
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the  state  transition  diagram  we  obtain  the  following  master  equation  of  dynamics: 


div(t) 

dt 


1  -  II  [1  -  7 (t)avu(t)iu(t)] 
uev 


[1 


iv(t)]  -  0v(t)iv(t), 


(2.2) 


where  avu(t)  =  1  if  and  only  if  (u,  v)  £  E(t). 


2.2  Sufficient  conditions  for  dying  out  in  the  scenario  of  semi-adaptive  defense 

In  this  subsection  we  present  a  set  of  sufficient  conditions  under  which  the  virus  spreading 
will  die  out.  The  sufficient  conditions  are  applicable  when  the  model’s  input  parameters, 
namely  the  f3v(t)’s  and  7 (f)  are  given.  Moreover,  it  is  possible  that  (3v{t)  relies  on  7 (f); 
for  example,  the  former  is  an  (implicit)  function  of  the  latter.  This  explains  why  we  call 
this  scenario  the  semi-adaptive  defense. 


Theorem  1.  (a  general  sufficient  condition  under  which  virus  spreading  dies  out) 
Consider  the  following  comparison  (and  linearization)  system  ofEq.  (2.2): 

avu{t)l{t)xu(t)  -  /3v(t)xv(t).  (2.3) 


Let  x(t) 


[x\(t),  •  •  •  ,  xn(t)]T ,  we  obtain  the  following  compact  form  of  Eq.  (2.3): 


dx(t) 

dt 


ry(t)A(t)  -  B(t) 


x{t). 


(2.4) 


Denote  by  U(t,t')  the  solution  matrix  of  linear  system  (2.4),  namely  that  each  solution 
of  linear  system  (2.4),  x(t),  with  initial  condition  x(t0)  =  Xo,  can  be  written  as  x(t)  = 
U(t,  t0)x0.  Because  the  solution  x(t)  is  dependent  upon  the  initial  value  Xq  but  the  so¬ 
lution  matrix  is  not,  the  corresponding  maximum  Lyapunov  exponent  (MLE)  is 

determined  by  the  solution  matrix  (rather  than  by  the  solution  x(t))  and  can  be  defined  as 
[Oseledec  1968]: 

p  =  lim  7  In  \\U(t,  0) ||| 

t— >00  t 

where  ||  ■  ||  is  (for  specifcality)  the  1-norm  of  matrix  (because  p  is  independent  of  the  choice 
of  the  norm).  If  p  <  0,  the  virus  spreading  will  die  out  regardless  of  the  initial  infection 
configuration1;  if  p  >  0  and  system  (2.4)  is  ergodic  [Oseledec  1968],  the  virus  spreading 
will  not  die  out  in  some  initial  infection  configurations  (i.e.,  “the  equilibrium  of  i*  =  0  is 
unstable”  in  mathematical  terms). 


proof.  Note  that 

1  -  n  t1  -  7 {t)avu{t)iu(t)}  <  7(f)  ]T  ^VU  (t)iu(t).  (2.5) 

uev  uev 

If  i„(0)  =  xv(0)  for  all  v  £  V,  then  the  comparison  system  (2.3)  satisfies  that  iv(t)  < 
xv(t)  holds  for  all  f  >  0  and  v  £  V.  Since  Eq.  (2.3)  is  actually  the  linear  system  of 
the  system  (2.2)  at  iv  =0  for  all  v  £  V.  Therefore,  we  can  conclude  that  the  stability 
of  Eq.  (2.2)  is  equivalent  to  that  of  Eq.  (2.3).  In  other  words,  if  all  xv(t)’s  of  system 
(2.3)  converge  to  zero,  then  system  (2.2)  is  stable  regardless  of  the  initial  values;  on  the 


1  An  initial  infection  configuration  is  specified  by  “which,  and  thus  how  many,  nodes  are  initially  infected". 
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other  hand,  if  system  (2.3)  is  unstable,  then  system  (2.2)  is  also  unstable.  If  system  (2.4) 
is  ergodic  [Oseledec  1968],  then  the  limit  p  exists;  otherwise,  we  can  alternatively  define 
/i  =  limj^oo  j  In  ||  U(t,  0)||,  where  limt_>00^(f)  represents  the  upper  bound  of  the  limit  of 
z(t)  as  t  goes  to  infinity,  is  also  guaranteed  to  exist.  In  any  case,  by  applying  the  definition 
of  MLE,  we  obtain  the  theorem  immediately.  □ 

Discussion.  The  above  sufficient  condition  p  <  0  for  the  virus  spreading  to  die  out  is 
actually  close  to  being  necessary,  meaning  that  if  p  >  0,  then  the  virus  spreading  will 
not  die  out  in  most,  rather  than  just  some,  initial  infection  configurations.  According  to 
the  Lyapunov  exponent  and  smooth  ergodic  theory  developed  by  [Pesin  1977]  and  many 
others,  p  >  0  means  that  the  system  (2.2)  possesses  an  unstable  manifold,  which  implies 
that  the  stable  manifold,  i.e.,  the  set  of  points  (i.e.,  the  initial  values)  starting  from  which 
the  system  (2.2)  converges  to  the  origin,  has  dimension  less  than  n.  Therefore,  the  stable 
manifold  has  Lesbegue  measure  0.  That  is,  except  a  set  with  Lesbegue  measure  0,  the  virus 
spreading  never  dies  out  with  respect  to  any  initial  infection  configuration. 

The  sufficient  condition  given  in  Theorem  1  is  very  general  because  in  its  derivation 
no  “amplification”  is  used  and  dynamical  topology  E(t)  is  accommodated.  However,  it 
requires  to,  among  other  things,  solve  a  system  of  n  linear  equations  of  n  variables  (equiv¬ 
alently,  diagonalizing  a  n  x  n  matrix),  which  can  be  quite  time-consuming  for  large  n  (the 
number  of  nodes).  In  what  follows  we  give  two  succinct  sufficient  conditions,  which  can 
be  easily  connected  to  previous  state-of-the-art  results. 


Theorem  2.  (a  succinct  sufficient  condition)  Suppose  ftv(t)  =  (3 (t)  for  all  v  £  V 
(i.e.,  all  nodes  have  the  same  cure  capability)  and  E(t )  =  E  for  any  time  t,  meaning  that 
the  topology  does  not  change  over  time  and  A  =  A(t)  for  any  t.  Consider  the  system 


dx(t) 

dt 


y{t)A  -  (3v{t)In 


x(t), 


(2.6) 


where  In  is  the  n  x  n  identity  matrix.  Let 

^  rt+to  _  ^ 

7  =  lim  -  /  7 (r)dr,  (3  =  lim  -  /  (3(r)dT, 

*  Jt0  t->°°  t  Jt0 

and  suppose  the  limits  exist  and  are  uniform  with  respect  to  to.  Let  Ai  be  the  largest  (in 
modulus)  eigenvalue  of  A.  If 


Ai  <  (2.7) 

7 

then  the  virus  spreading  will  die  out  regardless  of  the  initial  infection  configuration;  if 

Ai  >  |?  (2.8) 

7 

then  the  virus  spreading  will  not  die  out  in  some  initial  infection  configurations. 
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PROOF.  Consider  system  (2.4).  Let  A  =  S  1 JS  be  the  Jordan  canonical  form  with 


J\ 

A  k 

0 

0  •• 

0 

J2 

1 

A  k 

0  •• 

0 

>h 

,  Jk  = 

0 

1 

0 

Jk 

_0 

0 

...  1 

k 

where  A&,  k  =  1,  •  •  ■  ,  K ,  are  the  distinct  eigenvalues  of  A.  Recall  that  Ai  is  the  largest 
eigenvalue  in  modulus.  From  the  Perron-Frobenius  theorem  [Berman  and  Shaked-Monderer 
2003],  Ai  is  a  real  number.  Then,  letting  y(t)  =  Sx(t),  we  have 

dA  =  [1it)J-min]y{t). 


Namely, 


where  is  the  eigenvalue  of  A  corresponding  to  the  Jordan  block  that  contains  col¬ 
umn  v,  and  £„ (t )  =  0  if  the  u-th  row  of  S  is  an  eigenvector  of  A  and  £„ (f )  =  "/(t)yv-i(t) 
otherwise.  First,  consider  v  =  1  corresponding  to  the  eigenvalues  Ai.  We  have 

Vi(t)  =  3/i (0)  exp  (/  [7(t)Ai  -  /3(r)]dr^  .  (2.9) 


One  can  see  that  the  Lyapunov  exponent  of  system  (2.9)  is  calculated  as 


lim  j  In  |  |j/i(f)|  | 
t— >oo  t 


lim  j  In  |  |j/i(0)||  +  lim  i  f  [7(r)Ai  -  /3(r)]dr 

t— >oo  t  t—> oo  t  J Q 

0  +  7A1  —  p  <  0, 


which  implies  lim^oc  y\{t)  =  0. 

Assuming  lim^oo  yv(t)  =  0  already  proved,  consider  yv+i(t).  We  have 


Uv+i  (t)  =  s/u-j-i  (0)  exp 


+  /  tv+i(a)exp 

Jo 


k(r)dT 


da 


where  k[r)  =  7(r)Afe„+1  —  /3(t).  From  condition  (2.7),  there  exists  a  sufficiently  small 
e  >  0  such  that  (7  +  e)Ai  —  (/3  +  e)  <0.  Let  tp  =  —(7  +  e)Ai  +  0  +  e).  One  can  see 
that  ip  >  0.  Since  the  limits  f*“ +t  P(r)dr  and  ff°+t  7 (r)dr  are  uniform  with  respect  to 
to,  there  exists  T  >  0  such  that 


t  —  a 


P(t )dr  <  /3  +  e, 


t  —  a 


7 (r)dr  <  7  +  e 


hold  for  any  a  and  t  with  t  —  a  >  T.  Let  'JZe(z)  denote  the  real  part  of  a  complex  number 
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z.  Then,  we  have 


Tie 


t  —  a 


f  k(r)di 

J  a 


< 


t  —  a 
1 


[7(r)^e(Afc„+1)  -/3(r)]dr 


[7(t)Ai  -/3(r)]dT 


t  —  a 

<  (7  +  e)Ai  -  (j3  +  e)  =  -ip  <  0 

for  all  t  and  a  with  t  —  a  >  T .  This  implies  that  the  first  term  yv  +i(0)  exp (/0*fc(r)dr) 
converges  to  zero.  Let  M  >  0  be  a  constant  such  that  sup  |7(t)Aj  —  /3(r)|  <  M.  For  the 

T,j 

second  term,  we  have 

J  (a)  exp  ^  J  fc(r)dr^  da 

J  ^u+i(a)exp^y  k(r)dT^jda  +  J  €v+i (a)  exp  (^J  k(r)dT^j  da 


< 


/  |£„+i(a)|  exp  {-p(t  -  a))  da  +  /  |£„+i(a)|  exp(MT)da. 

Jo  Jt-T 


In  the  case  of  £,J+i (t)  =  0,  we  immediately  conclude  that  lim^oo  yv+i(t)  =  0.  Other¬ 
wise,  according  to  the  condition  and  the  L’ Hospital  principle,  we  have 


rt-T 

lim  /  |£„+i(a)|  exp  [  —  <p(t  —  a)l  da  =  lim 

>00  J q  L  J  t—HX 


Jo  Tl^+i(a)|exp(^a)da 

exp(tpi) 

\£v+1(t-T)\exp(tp(t-T)) 


=  lim 

t—*  OO 

Hm  \Vv(t  ~  T)j{t  -  T) |  exp(-yT))  =  Q 


(pexp(ipt) 


due  to  the  assumption  limt-voo  yv(t)  =  0.  We  also  have 


lim 

t — >00 


|£„_i_i(a)|  exp  (MT)da  =  0. 


t-T 


Therefore,  we  can  conclude  lim^^  yv  +i(0  —  0. 

Note  that  condition  (2.8)  implies  that  system  (2.4)  is  unstable.  Since  system  (2.4)  is  in 
fact  the  linearization  system  of  Eq.  (2.2),  we  can  conclude  that  system  (2.2)  is  unstable 
under  condition  (2.8).  This  completes  the  proof.  □ 


If  (3(t )  and  7(f)  are  ergodic  stochastic  processes,  from  the  multiplicative  ergodic  the¬ 
ory  of  the  random  dynamical  systems  [Arnold  1998],  we  have  the  following  result  as  a 
corollary  of  Theorem  2. 


COROLLARY  1.  (another  succinct  sufficient  condition)  Suppose  (3v(t)  =  (3 (t)  for  all 
v  £  V  (i.e.,  all  nodes  have  the  same  cure  capability)  and  E(t)  =  E  for  any  time  t 
(i.e.,  topology  does  not  change  over  time).  Suppose  {/3(f)}t>o  and  {7(i)}t>o  ergodic 
stochastic  processes  (i.e.,  (3(t)  and  7(f)  are  some  random  variables).  Let  E(/3(0))  and 
E(7(0))  be  the  expectations  with  respect  to  the  stationary  distributions  of  the  respective 
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ergodic  stochastic  process.  Suppose  the  convergences 


E(/3(0))  = 


'to 


P(r)dT,  E(7(0))  = 


fto+t 


7(r)dr, 


'to 


are  both  uniform  with  respect  to  to  almost  surely.  If 

x  .  E(/3(0)) 

1  E(7(0)); 

the  spreading  will  die  out  almost  surely  regardless  of  the  initial  infection  configuration;  if 

mo)) 

e(7(o))! 

the  spreading  will  not  die  out  in  some  initial  infection  configurations. 


Discussion.  The  state-of-the-art  sufficient  condition  for  the  dying  out  of  virus  spreading 
in  an  arbitrary  network  is  Ai  <  which  was  given  in  [Chakrabarti  et  al.  2008].  In  the 
setting  of  [Chakrabarti  et  al.  2008],  the  parameters  satisfy  that  (3v[t)  =  f3  for  all  v  £  V 
and  all  t,  and  7(f )  =  7  for  all  t.  As  such,  their  result  is  clearly  a  special  case  of  the  above 
Corollary  1  (note  that  it  is  guaranteed  that  E(7(0))  f  0),  and  thus  of  the  above  Theorem  2. 


2.3  Adaptive  control  in  the  scenario  of  fully-adaptive  scenario 

In  the  semi-adaptive  defense  scenario  investigated  above,  we  assumed  that  the  parameters 
7 (f)  and  (3v(t )  are  given.  What  if  they  are  not  given?  In  what  follows  we  investigate  a 
representative  scenario,  where  the  defender  is  not  given  7 (t)  but  can  observe  iv(t).  Specif¬ 
ically,  we  consider  two  sufficient  conditions  of  adaptive  control:  one  under  which  the  virus 
spreading  will  die  out  (Section  2.3.1),  and  another  under  which  the  virus  spreading  will 
not  die  out  but  will  be  contained  to  a  desired  level  of  infection  (Section  2.3.2). 


2.3.1  Sufficient  condition  under  which  the  virus  spreading  dies  out  under  adaptive 
control.  The  question  we  ask  is:  How  should  the  defender  adjust  the  defense,  namely  how 
(3 v(t)  should  depend  upon  iv(t),  so  that  the  virus  spreading  will  die  out?  We  assume  for 
concreteness  that  /3V( 0)  =  0  for  all  v  £  V;  this  accounts  for  the  worst-case  scenario. 


Theorem  3 .  (characterization  of  adaptive  control  strategy  under  which  the  virus  spread¬ 
ing  will  die  out)  Suppose  without  loss  of  generality  (3V{  0)  =  0  for  all  v  £  V.  If 


dt 


piv  (?) 1 


where  p  is  an  (almost)  arbitrary  positive  constant,  then  the  virus  spreading  will  die  out 
regardless  of  the  initial  infection  configuration. 


PROOF.  Define  a  candidate  Lyapunov  function  with  respect  to  the  infectious  probabili¬ 
ties  *  =  [*i,  •  •  •  ,  in]T  and  the  cure  capabilities  0  =  [/3i,  •  •  •  ,  /3n]T : 

n  1  n 

V{i,(3)  =  ^2iv(t)  +  —'^2{f3v{t)  -  Po,v)2. 

V=1  P V=1 

Let  (3o,v,v  =  1,  •  •  •  ,  n,  be  positive  constants  satisfying 

Po,v  >  {n  -  1)  sup  7(f)  +  1,  V  v  =  1,  ■  •  •  ,n 

t 
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owing  to  the  fact  that  1  >  7 (f)  >  0.  Due  to  inequality  (2.5),  differentiating  V ( i ,  $)  gives 


dt 


=  £ 


1-  n  (l-7(f)iu(f)) 


(1  -  i„(t))  -  y ^pv(t)iv{t) 


+  £($’(0  -  /30,v)iv(t) 

V=1 

n  n  n  n 

&VU  ( t)iu{t )  -^2(3v{t)iv(t)  +  y^(/3„(f)  -  /3o,v)iv(t) 

V=1  U—l  v—1  v—1 

n  n  n 

< 

&VU  ^  ^  fio^yiy  (t) 

d=1  n— 1  11=1 

n  n 

<  7(f)  (n  -  1)  y~X(t)  -  ^Po.yiyjt) 

U—l  V  =  1 

n 

—  ~  y 

V=1 

According  to  the  LaSalle  principle  [LaSalle  1960],  the  system  converges  to  the  largest 
invariant  set  :  J2v=i  =  0},  which  implies  limt^00i^(f)  =  0  for  all  v  = 

1,  ■  •  •  ,  n.  □ 

The  above  Theorem  3  has  the  following  implications. 

PROPOSITION  1 .  We  can  bound  from  above  the  accumulated  number  of  infected  nodes 
in  the  long  run  (a  node  is  counted  multiple  times  if  it  is  infected  at  multiple  points  in  time) 
as  follows: 


(n  -  l)27m  n-1 


>o 


V=1 


y>(0) 

\  v=l 


(n  -  1)372 
2  P 


(2.10) 


where  7 m  =  sup  7(f)  and  f3o  m  =  min/?o 

t  V 

PROOF.  From  the  proof  of  Theorem  3,  we  have 

dV{i,B) 


dt 


—  1)  Pp ,m]  ^  ^  ’tyjl): 


V  =  1 


which  implies 


'0 


1 


Po  ,m  7m  (tl  1) 


[V(0)  -  F(i)]  < 


Po,m  -7 m(ro  -  1) 


V(0). 


If  pv  (0)  =  0  for  all  v  =  1,  ■  ■  •  ,n  and  all  /3„  o  are  the  same  and  are  thus  denoted  by  /3o,  we 
have  the  following  estimation: 


y^i(r)dr  < 


1 


V=1 


Po  -  lm{n  -  1) 


n  1 

[£^(°)  +  Yp (n_  !)^o] - 


■U—l 
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By  setting 

A-“+ - l/(2p)(— 1) - ■ 

we  obtain  the  minimum  of  the  right-hand  side  and  thus  complete  the  proof.  □ 


Physical  meanings  of  Proposition  1.  The  term  at  the  left-hand  side  of  inequality  (2.10) 
captures,  in  addition  to  the  aforementioned  estimation  of  the  total  number  of  infected  nodes 
in  the  network  (counting  repetition)  over  time,  the  convergence  rate  of  the  adaptive  control 
strategy.  This  allows  us  to  draw  the  following  insights:  (i)  The  larger  p,  the  faster  the 
virus  spreading  will  die  out;  (ii)  the  larger  degree  of  initial  infection,  the  slower  the  virus 
spreading  will  die  out;  (iii)  the  larger  edge  infection  probability  7,  the  slower  the  virus 
spreading  will  die  out. 

2.3.2  Adaptive  control  under  which  the  virus  spreading  will  not  die  out  but  will  be 
contained  to  a  desired  level.  In  the  above  we  have  given  some  sufficient  condition  on 
adjusting  the  defense  or  ft  (i)  so  that  the  virus  spreading  will  die  out.  What  if  the  required 
ft(f)  cannot  be  achieved,  meaning  that  we  may  not  expect  that  the  virus  spreading  die 
out?  This  is  possible  because  the  defense  may  not  be  as  good  as  one  may  wish  or  because 
of  budget  limitation.  In  this  case,  we  ask  an  alternative  interesting  question:  What  it  takes 
so  that  iv(t)  can  converge  or  be  contained  to  some  pre-determined  level  of  infection  i*l 


Theorem  4.  (characterization  of  adaptive  control  strategy  under  which  the  virus  spread¬ 
ing  will  be  contained  to  a  desired  level  of  infection)  Consider  the  following  variant  of  the 
master  equation  Eq.  (2.2), 


diy(t) 

dt 


!-  n  [1 — 7  ■  *«(£)] 

(u,v)£E 


(i 


I'V  (f))  T  tVy 


(2.11) 


For  any  1  >  i*  >  0,  v  =  1,  •  •  •  ,  n,  letting 
Pv 


if  we  use  the  following  adaptive  control  strategy 


=  P[*«(0  -  Qivit),  Wy  =  p[i*v  -  iv(t)], 

where  p  is  an  (almost)  arbitrary  positive  constant,  and  p  is  a  positive  constant  with  p  + 
min  (3*  >  1  +  (1  —  7)Ai  where  Ai  is  the  largest  (in  modulus)  eigenvalue  of  the  adjacency 

V 

matrix  A,  we  have  limt_>00  iv{t)  =  i*. 


PROOF.  From  Perron-Frobenius  theorem  [Berman  and  Shaked-Monderer  2003],  we 
have  that  there  exist  some  positive  constants  Pi,  •  •  •  ,  Pn  such  that  P  =  diag[Pi,  •  •  •  ,  If] 
satisfies  [PA  +  AT P]  <  Ai P.  Consider  the  candidate  Lyapunov  function  with  respect  to 

i  =  [ft  ■  ■  •  ,in]T  and  /3  =  [ft,  ■  •  ■  ,ft]T: 

1  n  1  n 

v(ij) ^]2  +  2"  E  p «(&  -  ^)2- 

■») — i  *  -n — i 
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Differentiating  V(x,  (3)  gives 

dV(i,P) 

dt 


=  ^2pv[iv(t) (1  -7  •  i„(f))](l -0v{t)iv(t)  -  T)(iv -i*)j 

V=1  ^  (u,v)eE  ' 


(u,v)£E 
n 

+  'Yhpv{Pv{t)  -  f3*)(iv(t)  -  i*)iv(t) 


V  =  1 


=  ’jrpv[iv{t) -  n  (! -7- *«(*))](! 

v=i  '  ( u,v)eE 

-[!-  n  i1 -'Vi*u)](1-iZ)  -  Pv{t)iv(t)  - /3*il  +  v(iv -i*v) | 


(u,v)€E 
n 

+  EPvVv(t)+(3:)(iv(t)  -i*v)iv{t). 


V—l 


Note  that 


[1  n  (l-7-iu(t))](l-i«(*))-[l-  IT  (l-7-C)](l-C) 


(u,v)£E 

(u,v)GE 

< 

(iv(t)-i*v)  (1-7  iv(t)) 

+ 

a-o[  n  a- 

7  (1-70] 

(u,v)€.E 

(u,v)GE 

(u,v)£E 

< 

iv(t)-i*v\+ 

n  i1-^ 

n  i1  -7O(0i 

(u,v)£.E 

(ui,v)£E,  U\~>U 

(lL2,v)£E,  U2<U 

< 

iv{t)-i*v  |+7  M*)“* 

*  |. 
u  1 

(u,v)£E 

Thus,  we  have 


dV(iJ) 

dt 


<  ^Pv\iv{t)-il\2 +  ^_j'^-iPvavU\ 

V—l  V—l U—l 


ivit)  -i*v\\iu(t)  —  i*v 


n  n 

Pvpv(iv{t)  ~C)2  -r/J^Pv[iv(t)  -  i*v ]2- 

V=1  V=1 

Let  z(t)  =  [z\(t),  ■  ■  ■  ,  zn(t)\T  with  zv(t)  =  \iv(t)  —  i*\.  Since  f3*  >  0,  we  have 

dV ^  ®  <  zT[P  +  j(PA  +  ATP)/2-r]P  -  B*P]z 
<  zT[  1  +  7A1  -  77  —  min  /3*]  Pz  <  0  ,  V  z  ^  0, 


where  B*  =  diag[/3i,  •  •  •  ,  /?„].  Due  to  the  LaSalle  principle  [LaSalle  1960],  we  have  that 
the  system  will  converge  to  the  largest  invariant  set  in  {(*,/?)  :  ^”_i  [iv  —  **]2  =  0}, 

which  implies  that  lim^oo  iv(t)  =  z*  for  all  v  =  1,  •  •  •  ,  n.  □ 
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Discussion.  The  above  theorem  is  quite  general  because  of  the  term  wv  in  Eq.  (2.11).  If 
wv  ^  0,  the  adaptive  control  strategy  must  be  used  with  caution  because  we  must  guarantee 
that  its  value  does  have  physical  meanings.  In  general,  wv  <  0  would  be  reasonable;  in 
our  simulation  study  (Section  3),  we  set  wv  =  0  for  simplicity.  On  the  other  hand,  the 
theorem  is  necessarily  based  on  the  premise  that  7 (f)  =  7,  namely  that  7  (t)  does  not  vary 
with  respect  to  time,  because  of  the  way  f3*  is  defined.  In  our  simulation  study  (Section  3), 
we  will  show  that  the  result  is  quite  robust,  meaning  that  even  if  7  (f)  varies  with  respect 
to  time  (as  we  considered  in  Section  2.2),  the  result  is  still  valid.  This  is  very  important 
because  the  fixed  7  can  be  seen  as,  in  a  sense,  the  average  of  the  unknown  7 (f)  over  time. 

Similar  to  Proposition  1,  we  can  have 

Proposition  2.  We  have 


(2.12) 


V=1 


Physical  meanings  of  Proposition  2.  The  above  proposition  offers  the  following  insights: 
The  larger  p,  the  larger  77,  the  smaller  Ai,  the  smaller  7,  the  smaller  differential  between 
*„(0)  and  i*,  or  the  smaller  the  differential  between  f3v( 0)  and  /3*,  the  faster  the  virus 
spreading  will  die  out. 

3.  SIMULATION  STUDY 

We  conduct  simulation  to  complement  our  analytic  study  for  two  purposes.  First,  we  want 
to  confirm  our  analytical  results  offered  in  Section  2.  Second,  we  want  to  draw  some 
relevant  observations  that  are  not  offered  by  our  analytic  results.  Such  observations  may 
guide  future  studies  of  analytic  models  (e.g.,  how  to  enhance  them  so  that  other  useful 
insights  may  be  obtained  analytically). 

As  mentioned  above,  our  model  is  very  general  because  it  accommodates  dynamical 
graph  topology  G(t)  =  ( V ,  E(t)).  However,  it’s  not  clear  at  this  stage  how  to  appropriately 
define  a  physically  meaningful  way  according  to  which  the  topology  changes.  Therefore 
we  leave  the  full-fledged  characterization  (beyond  what  is  implied  by  our  analytical  results) 
of  the  impact  of  dynamical  topology  to  future  work.  We  conducted  simulations  using  both 
synthetic  (regular,  random,  and  power-law)  graphs  and  a  real  network  graph.  Due  to  space 
limitation,  here  we  report  the  simulation  results  in  the  latter  case  (but  all  the  simulation 
results  are  consistent).  The  real  network  graph  G  =  ( V.  E)  is  based  on  the  Oregon  router 
views  (available  from  http  :  /  /topology  .  eecs  .  umich  .  edu/data  .  html),  where 
\V\  =  11,461  representing  AS  peers,  \E\  =  32,  730  representing  links  between  the  AS 
peers.  The  largest  eigenvalue  of  the  corresponding  adjacency  matrix  is  Ai  =  75.2407. 

3.1  Methodology 

Our  simulation  is  conducted  in  an  event-driven  fashion.  For  the  purpose  of  studying  the 
dynamics  under  our  adaptive  control  strategies,  we  need  to  measure  iv(t),  the  probability 
that  node  v  £  V  is  infected  at  time  t.  This  parameter  can  be  obtained  in  a  tedious  way 
(i.e.,  by  conducting  for  example  100  simulation  runs  in  parallel,  rather  than  in  sequence, 
because  we  need  to  count  the  number  of  times  each  node  v  is  infected  at  each  time  step). 
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A  much  more  simpler  way  however  is  to  use  our  model  formula  Eq.  (2.2)  to  compute 
iv(t)  instead,  as  long  as  the  model  is  accurate.  To  confirm  the  accuracy  of  the  model,  we 
compare  it  with  simulation.  For  simplicity,  we  let  both  /3(f)  and  7 (t)  be  some  periodic 
functions  with  period  T,  which  means  that  both  attack  and  defense  vary  with  respect  to 
time.  We  consider  three  settings:  /3(f)  and  7(f)  being  synchronous,  asynchronous,  or  anti- 
synchronous  because  we  want  to  observe  whether,  and  if  so  to  what  extent,  the  degree  of 
(a)synchrony  has  an  impact  on  the  outcome.  Specifically,  we  consider  two  sets  of  param¬ 
eters:  /3(f)  £  {0.3, 0.5}  and  7(f)  £  {0.003, 0.007}  according  to  the  functions  shown  in 
Figure  2;  /3(f)  £  {0.005,0.015}  and  7(f)  £  {0.003,0.007}  in  the  same  fashion.  In  the 
asynchronous  case,  we  let  f3v  (f )  is  T /4  behind  jv  (f )  because  cure  often  comes  after  attack 
is  identified.  To  draw  insights  into  whether  the  period  T  has  an  impact  on  the  outcome,  we 
consider  T  =  8,  16,  respectively.  In  any  case,  it  is  clear  that  /3(f)  is  an  implicit  function  of 
7(t)- 


PM  • 


7(t)  • 


P(t)  ■ 


7<t) 


P(t)  • 


(a)  synchronous  setting 


(b)  asynchronous  setting  (c)  anti-synchronous  setting 


Fig.  2.  Examples  of  synchronous,  asynchronous,  and  counter-synchronous  /3(f)  and  7(f) 

Figure  3  plots  the  curves  obtained  by  simulation  and  by  model  computing  in  the  case  of 
both  /3„(f)  and  jv(t)  have  period  T  =  8  and  T  =  16  (as  shown  in  Figure  2).  In  each  graph, 
we  let  the  virus  initially  infect  2,292  or  20%  vertices  that  are  randomly  selected;  note  that 
the  degree  of  initial  infection  does  not  impact  whether  the  virus  spreading  will  die  out 
or  not.  Since  the  model  computing  and  simulation  results  (obtained  as  the  average  of  50 
simulation  runs)  match  almost  perfectly  no  matter  the  virus  spreading  will  die  out  or  not, 
we  will  use  simulation  and  model  computing  interchangeably.  Since  the  same  phenomenon 
applies  to  both  cases  of  T  =  8  and  T  =  16,  in  what  follows  we  only  report  the  case  of 
T  =  8.  Again,  the  accuracy  result  allows  us  to  obtain  iv  (f )  via  model  computing  in  the 
process  of  confirming  the  analytical  results  of  our  adaptive  control  strategies. 

3.2  Confirmation  of  the  sufficient  conditions  in  the  semi-adaptive  scenario 

In  this  section,  we  use  the  aforementioned  Oregon  graph  to  confirm  our  analytical  results 
presented  in  Section  2.2.  We  confirm  Theorem  2  and  Corollary  1  because  they  offer  suc¬ 
cinct  sufficient  conditions  under  which  the  virus  spreading  dies  out. 

3.2. 1  Confirmation  of  Theorem  2.  For  the  graph,  we  let  the  virus  initially  infect  2,292 
or  20%  randomly  selected  nodes,  and  consider  three  cases  between  the  model’s  input  pa¬ 
rameter  —  synchrony,  asynchrony,  and  anti-synchrony  as  illustrated  in  Figure  2.  Since 
Theorem  2  has  two  parts,  we  confirm  them  respectively. 

Case  1:  Confirmation  of  the  sufficient  condition  under  which  the  virus  spreading 
will  die  out.  We  consider  two  sets  of  parameters:  (i)  /3(f)  £  {0.3,  0.5}  and  7 (f)  £ 
{0.003,0.007};  (ii) /3(f)  £  {0.1, 0.22}  and  7(t)  £  {0.001,0.003}.  Both  functions,  /3 (t) 
and  7 (f),  have  period  T  =  8.  Both  parameter  sets  satisfy  the  sufficient  condition  of  Theo¬ 
rem  2,  namely  Ai  <  /3/7,  which  means  that  the  virus  spreading  will  die  out.  Figure  4  plots 
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Fig.  3.  Model  accuracy 


2100 

1800 


. 

synchronous  case  — i — 
asynchronous  case  x-  _ 
anti-synchronous  case 

1  -  - 

A 

-v£ . — 

-  V 

-  \ 

time 


(a)  Parameter  set  (i) 


(b)  Parameter  set  (ii) 


Fig.  4.  Confirmation  of  sufficient  condition  under  which  virus  spreading  dies  out  (T  =  8) 


the  dynamics  of  the  numbers  of  infected  nodes  with  respect  to  time.  From  Figure  4  we  can 
draw  the  following  observations.  First,  the  virus  spreading  does  die  out  at  about  the  50th 
and  25th  step,  respectively,  which  confirms  the  sufficient  condition  under  which  the  virus 
spreading  will  die  out.  It  is  an  interesting  future  work  to  quantitatively  characterize  how 
the  speed  of  convergence  (i.e.,  dying  out)  depends  upon  functions  /3(f)  and  7(f).  Second, 
it  is  counter-intuitive  and  interesting  that  the  virus  spreading  is  somewhat  more  effectively 
defended  against  in  the  anti-synchronous  case  than  in  the  synchronous  case,  which  is  in 
turn  more  effectively  defended  against  than  in  the  asynchronous  case.  More  studies  are 
needed  in  order  to  explain  this  phenomenon.  Third,  the  curves  are  convex,  meaning  that 
cure  is  more  effective  in  the  early  stage  of  the  attack-defense  dynamics  than  in  the  later 
stage.  For  example,  it  takes  a  shorter  period  of  time  to  reduce  the  infection  from  2,292 
nodes  to  100  nodes  than  to  reduce  the  infection  from  100  nodes  to  zero  nodes  (i.e.,  dying 
out). 

Case  2:  Confirmation  of  the  condition  under  which  the  virus  spreading  may  not  die 
out.  We  consider  two  sets  of  parameters:  (i)  /3(f)  6  {0.2,  0.4}  and  7 (t)  £  {0.003, 0.007}; 
(ii)  /3(f)  £  {0.05,0.15}  and  7(f)  £  {0.001,0.003}.  Both  functions,  /3(f)  and  7(f),  have 
period  T  =  8.  Both  parameter  sets  do  not  satisfy  the  sufficient  condition  of  Theorem  2 
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because  Ai  >  fl/'y,  which  means  that  the  virus  spreading  does  not  die  out  in  some  initial 
infection  configurations. 
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Fig.  5.  Confirmation  of  virus  spreading  not  dying  out  (T  =  8) 


Figure  5  plots  the  dynamics,  from  which  we  draw  the  following  observations.  First, 
the  virus  spreading  does  not  die  out,  which  confirms  Theorem  2.  Second,  all  the  curves 
exhibit  periodic  behaviors,  but  the  extent  of  oscillation  in  the  case  of  anti-synchrony  is  more 
significant  than  in  the  case  of  asynchrony,  which  in  turn  is  more  significant  than  in  the  case 
of  synchrony.  This  means  that  the  degree  of  synchrony  between  /3(f )  and  7 (f)  will  impact 
the  outcome  when  the  virus  spreading  does  not  die  out.  Third,  comparing  Figures  5(a)  and 
5(b),  we  observe  that,  under  the  same  synchrony,  the  outcome  will  depend  on  functions 
/ 3(f )  and  7(f).  More  studies  are  needed  to  characterize  these  dependence  relationships. 

3.2.2  Confirmation  of  Corollary  1.  Corollary  1  gives  an  even  more  succinct  sufficient 
condition  under  which  the  virus  spreading  will  die  out.  For  the  graph,  we  let  the  virus 
initially  infect  2,291  or  20%  randomly  selected  nodes.  For  the  case  the  sufficient  condition 
in  Corollary  1  is  satisfied,  we  consider  two  sets  of  parameters  that  are  uniformly  chosen 
at  random  from  certain  intervals:  (i)  /3(f)  G  [0.7, 0.9]  and  7 (f)  G  [0.006,0.0014];  (ii) 
/3(f)  G  [0.1, 0.3]  and  7(f)  G  [0.0015,0.0035].  In  each  parameter  setting,  the  sufficient 
condition  stated  in  Corollary  1  is  satisfied,  namely  Ai  <  E(/3(0))/E(7(0)),  which  means 
that  the  virus  spreading  will  die  out. 

For  the  case  the  sufficient  condition  in  Corollary  1  is  not  satisfied,  we  consider  two 
sets  of  parameters  that  are  uniformly  chosen  at  random  from  certain  intervals:  (i)  /3(f)  G 
[0.4, 0.6]  and  7(f)  G  [0.006, 0.014];  (ii)  /3(f)  G  [0.05,0.15]  and  7(f)  G  [0.0015,0.0035]. 
In  each  parameter  setting,  the  sufficient  condition  stated  in  Corollary  1  is  not  satisfied 
because  Ai  >  E(/3(0))/E(7(0)),  and  thus  the  analytical  result  says  that  the  virus  spreading 
does  not  die  out  in  some  initial  infection  configurations. 

Figure  6(a)  plots  the  dynamics  of  the  number  of  infected  nodes  with  respect  to  time 
when  the  sufficient  condition  is  satisfied.  From  it  we  can  draw  the  following  observations. 
First,  the  virus  spreading  does  die  out,  as  predicted  by  Corollary  1.  Second,  the  larger  the 
/3(f),  the  more  effective  the  defense  against  the  virus  spreading.  Third,  all  the  curves  are 
convex,  meaning  that  it  takes  a  shorter  period  of  time  to  significantly  reduce  the  number 
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(b)  The  sufficient  condition  is  not  satisfied 


Fig.  6.  Confirmation  of  Corollary  1  (T  =  8) 


of  infected  nodes  (e.g.,  from  2,291  to  50)  than  to  making  the  virus  spreading  die  out  (e.g., 
from  50  to  zero). 

Figure  6(b)  plots  the  dynamics  of  the  number  of  infected  nodes  with  respect  to  time 
when  the  sufficient  condition  is  not  satisfied.  From  it  we  can  draw  the  following  obser¬ 
vations.  First,  the  virus  spreading  does  not  die  out,  which  confirms  Corollary  1.  Sec¬ 
ond,  the  larger  (in  a  stochastic  sense)  the  /3(f),  the  earlier  the  system  will  converge  to  the 
steady  state.  However,  the  ultimate  degree  of  infection  does  not  depend  on  /3(f),  but  rather 
on  E(/3(0))/E(7(0)),  which  means  that  E(/3(0))/E(7(0))  may  be  used  as  an  indicator  of 
steady-state  infection  when  the  virus  spreading  does  not  die  out.  It  is  an  interesting  future 
work  to  rigorously  characterize  this  phenomenon. 

3.3  Confirmation  of  the  controllability  in  the  fully-adaptive  scenario 

3.3.1  Confirmation  of  Theorem  3.  Theorem  3  states  that  even  if  we  do  not  know  7(f) 
but  we  may  be  able  to  observe  iv(t)  and  may  be  able  to  adjust  the  defense  as  needed, 
following  its  control  strategy  will  cause  the  dying  out  of  the  virus  spreading.  To  compare 
the  effects  of  adaptive  control  and  semi-adaptive  control,  in  our  simulation  study,  we  also 
used  the  periodical  functions  /3(f)  £  {0.375,  0.40}  and  7(f)  £  (0.003, 0.007}  with  period 
T  =  8  as  illustrated  in  Figure  2.  These  parameters  satisfy  the  sufficient  condition  in  Theo¬ 
rem  2,  which  means  that  the  virus  spreading  will  die  out  as  we  discussed  above.  To  ensure 
comparability,  we  also  let  the  virus  initially  infect  2,291  or  20%  randomly  selected  nodes. 
As  mentioned  before,  since  the  anti-synchronous  defense  is  somewhat  more  effective  than 
the  synchronous  and  asynchronous  defenses,  we  will  compare  it  with  the  outcome  of  the 
adaptive  control  strategy. 

Figure  7(a)  plots  the  dynamics  of  the  number  of  infected  nodes  with  respect  to  time 
in  the  following  four  cases:  the  adaptive  control  parameter  p  =  0.005;  the  adaptive 
control  parameter  p  =  0.01;  the  adaptive  control  parameter  p  =  0.02;  the  comparison 
dynamics  corresponding  to  the  anti-synchronous  case  with  T  =  8  periodical  function 
/3(f)  £  (0.375,  0.40}.  We  draw  the  following  observations.  First,  p  plays  a  crucial  role 
in  indicating  the  rate  at  which  the  virus  spreading  dies  out.  For  example,  for  p  =  0.02,  it 
takes  only  about  80  steps  to  reduce  the  number  of  infected  nodes  from  2,291  to  160  (nev¬ 
ertheless  it  takes  another  60  steps  to  kill  the  virus  spreading,  namely  to  reduce  the  number 
of  infected  nodes  from  160  to  zero);  for  p  =  0.01,  it  takes  about  130  steps  to  reduce  the 
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Fig.  7.  Confirmation  of  semi-adaptive  dying  out  vs.  fully-adaptive  control 


number  of  infected  nodes  from  2,291  to  160  (nevertheless  it  takes  about  another  100  steps 
to  kill  the  virus  spreading,  namely  to  reduce  the  number  of  infected  nodes  from  160  to 
zero).  This  also  confirms  the  physical  meanings  of  Proposition  1  discussed  above. 

Second,  Figure  7(a)  indicates  that  for  all  p  =  0.005,  p  =  0.01  and  p  =  0.02,  the 
fully-adaptive  defenses  are  less  effective  than  the  semi-adaptive  defense  represented  by 
/3(f)  G  {0.375,0.40}.  As  we  show  in  Figure  7(b),  this  is  caused  by  the  fact  that  the 
semi-adaptive  /3(f)  is  much  larger  than  the  adaptive  /3(f).  This  means  that  the  sufficient 
condition  in  the  semi-adaptive  case,  under  which  the  virus  spreading  dies  out,  may  be 
significantly  beyond  being  necessary.  In  contrast,  the  fully-adaptive  control  strategy  is 
much  more  “cost-effective”  because  larger  /3(f)  will  likely  cause  a  higher  cost. 

3.3.2  Confirmation  of  Theorem  4.  Theorem  4  states  that  even  if  we  do  not  know  7(f) 
but  we  may  be  able  to  observe  iv(t)  and  may  be  able  to  adjust  the  defense  (but  can¬ 
not  kill  the  virus  spreading),  then  following  its  control  strategy  will  cause  the  contain¬ 
ment  of  the  virus  spreading.  In  our  simulation  study,  we  used  the  periodical  function 
7 (f)  G  (0.0005, 0.001}  with  period  T  =  8  similar  to  what  was  shown  in  Figure  2.  This 
input  parameter  is  not  used  in  our  adaptive  control  algorithm,  rather  it  is  merely  for  the 
purpose  of  comparison  to  the  sufficient  condition  in  Theorem  2,  which  requires  the  (3V  (f ) 
satisfy  certain  property  (for  example,  we  use  /3(f)  G  {0.01, 0.02},  meaning  that  the  virus 
spreading  does  not  die  out  as  predicated),  and  the  /3„(f)  derived  from  our  adaptive  con¬ 
trol  strategy.  To  ensure  the  comparability,  we  also  let  the  virus  initially  infect  2,291  or 
20%  randomly  selected  nodes.  As  mentioned  before,  since  the  anti-synchronous  defense  is 
more  effective  than  the  synchronous  and  asynchronous  defenses,  we  will  compare  it  with 
the  outcome  of  the  adaptive  control  algorithm. 

In  our  simulation  we  set  i*  =  0.1  (i.e.,  we  want  to  contain  the  degree  of  infection  to 
10%)  and  uv  =  0  (i.e.,  a  special  case  of  Theorem  4).  Figure  8(a)  plots  the  dynamics 
of  the  number  of  infected  nodes  with  respect  to  time  in  the  following  three  cases:  the 
adaptive  control  parameter  p  =  0.001;  the  adaptive  control  parameter  p  =  0.001  with 
fixed  7(f)  =  (0.0005  +  0.001)/2  =  0.00075  as  specified  in  Theorem  4);  the  compari¬ 
son  dynamics  corresponding  to  the  anti-synchronous  case  with  T  =  8  periodical  function 
/3(f)  G  {0.01, 0.02}.  We  draw  the  following  observations.  First,  the  control  strategy  does 
contain  the  infection  to  the  pre-determined  level  of  i*  =  0.1  or  10%  infection.  Moreover, 
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Fig.  8.  Confirmation  of  the  vims  spreading  containment  via  fully-adaptive  control  (the 
red-color  curves  are  hidden  behind  the  green-color  curves) 

the  adaptive  control  strategy  is  robust  because  perturbation  in  7  does  not  fundamentally 
change  the  dynamics  behavior.  This  also  confirms  the  physical  meanings  of  Proposition  2 
discussed  above. 

Second,  Figure  8(a)  indicates  that  the  adaptive  defenses  are  slightly  less  effective  in 
defending  against  the  virus  spreading  than  the  defense  of  T  =  8  periodical  function  /3(f)  e 
{0.01, 0.02}.  As  we  show  in  Figure  8(b),  the  adaptive  control  strategy  can  be  much  more 
“cost-effective”  because  it  leads  to  significantly  smaller  /3(f). 

4.  CONCLUSION 

We  have  presented  a  novel  dynamical  systems  model  for  studying  both  semi-adaptive  and 
fully-adaptive  defenses  against  virus  spreading.  For  semi-adaptive  defense,  we  give  gen¬ 
eral  as  well  as  succinct  sufficient  conditions  under  which  the  virus  spreading  will  die  out. 
For  fully-adaptive  defense,  we  characterize  two  adaptive  control  strategies  under  which 
the  virus  spreading  will  die  out  or  will  be  contained  to  a  desired  level  of  infection.  Our 
analytical  results  are  confirmed  with  simulation  study. 

This  paper  brings  a  range  of  open  questions  for  future  research.  In  addition  to  those 
mentioned  in  the  body  of  the  paper,  here  are  more  examples:  What  are  the  necessary 
conditions  under  which  the  virus  spreading  will  die  out?  What  are  the  optimal  adaptive 
control  strategies? 
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Abstract — Understanding  the  spreading  dynamics  of  computer  viruses  (worms,  attacks)  is  an  important  research  problem,  and  has 
received  much  attention  from  the  communities  of  both  computer  security  and  statistical  physics.  However,  previous  studies  have 
mainly  focused  on  single-virus  spreading  dynamics.  In  this  paper,  we  study  multivirus  spreading  dynamics,  where  multiple  viruses 
attempt  to  infect  computers  while  possibly  combating  against  each  other  because,  for  example,  they  are  controlled  by  multiple 
botmasters.  Specifically,  we  propose  and  analyze  a  general  model  (and  its  two  special  cases)  of  multivirus  spreading  dynamics  in 
arbitrary  networks  (i.e.,  we  do  not  make  any  restriction  on  network  topologies),  where  the  viruses  may  or  may  not  coreside  on 
computers.  Our  model  offers  analytical  results  for  addressing  questions  such  as:  What  are  the  sufficient  conditions  (also  known  as 
epidemic  thresholds)  under  which  the  multiple  viruses  will  die  out?  What  if  some  viruses  can  “rob”  others?  What  characteristics  does 
the  multivirus  epidemic  dynamics  exhibit  when  the  viruses  are  (approximately)  equally  powerful?  The  analytical  results  make  a 
fundamental  connection  between  two  types  of  factors:  defense  capability  and  network  connectivity.  This  allows  us  to  draw  various 
insights  that  can  be  used  to  guide  security  defense. 

Index  Terms — Multiple  virus  dynamics,  epidemic  dynamics,  epidemic  threshold,  complex  networks,  complex  systems,  cyber  warfare 
model. 
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1  Introduction 

O  solve  the  problem  of  computer  viruses  (malware, 
worms,  or  bots),  we  need  a  set  of  approaches,  ranging 
from  legislation  to  technology.  One  technological  ap¬ 
proach,  we  call  the  dynamical  systems  approach,  aims  to 
understand  the  attack(-defense)  dynamics  so  as  to  offer 
insights  and  guidance  for  defense  from  a  whole-system 
perspective  (i.e.,  by  addressing  macroscopic  questions).  This 
approach  is  complementary  to  the  well  established 
approaches  such  as  access  control  and  cryptography. 
Despite  the  attention  that  has  been  paid  by  communities 
including  computer  security  and  statistical  physics,  exist¬ 
ing  studies  mainly  focused  on  single-virus  spreading 
dynamics.  In  this  paper,  we  study  multivirus  spreading 
dynamics,  where  multiple  viruses  may  be  controlled  by 
multiple  attackers  and  thus  may  combat  against  each 
other.  We  have  witnessed  incidents  of  this  type.  For 
example,  the  Welchia  worm  tried  to  "kill"  the  Blaster 
worm  by  deleting  the  Msblast.exe  file  created  by 
W3 2  .  Blaster  .Worm  [26].  Although  such  incidents  are 
not  exactly  as  full-fledged  and  sophisticated  as  the  virus 
combats  investigated  in  the  paper,  they  can  be  seen  as  a 
prelude  to  (some  of)  the  investigated  scenarios  that  may 
soon  become  reality.  As  such,  we  need  to  investigate  and 
understand  the  consequences  of  fighting  viruses  so  as  to 
get  prepared  by  enhancing  our  body  of  knowledge. 
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1.1  Our  Contributions 

For  multivirus  spreading  dynamics,  there  are  two  scenarios: 
the  viruses  spread  independent  of  each  other  and  thus  the 
dynamics  can  be  understood  as  a  trivial  extension  of  the 
single-virus  dynamics;  the  viruses  spread  nonindepen- 
dently  and  may  further  fight  against  each  other.  The 
present  paper  focuses  on  the  latter  scenario  and  its  two 
special  cases:  1)  the  viruses  can  coreside  on  computers,  and 
2)  the  viruses  do  not  coreside  on  computers  because  the 
viruses  can  "rob"  each  other.  Specifically,  we  make  the 
following  contributions. 

First,  we  propose  a  novel  continuous-time  nonlinear 
dynamical  system  model  for  the  spreading  of  multiple  viruses 
in  arbitrary  networks  (because  any  topology  would  be 
possible  in  practice).  The  general  model  is  difficult  to 
analyze,  but  we  manage  to  present  general  sufficient 
conditions  (also  known  as  epidemic  thresholds)  under  which 
all  viruses  will  die  out  (one  applies  regardless  of  the  degree 
of  initial  infection,  and  the  other  applies  when  the  initial 
infection  is  light).  We  also  present  simplified  versions  of 
these  sufficient  conditions  so  as  to  make  them  easier  to  use. 

Second,  in  order  to  gain  deeper  results,  we  further 
consider  the  aforementioned  two  special  cases  of  the 
general  model.  While  the  special  cases  naturally  inherit 
the  sufficient  conditions  in  the  general  model  (which 
become  more  succinct  in  the  special  cases  though),  we 
additionally  obtain  the  following  results.  In  the  special 
case,  where  multiple  viruses  do  not  coreside  on  compu¬ 
ters,  we  investigate  the  combat  between  the  viruses, 
namely  that  some  viruses  can  "rob"  other  viruses  (e.g., 
botmasters  rob  each  other's  bots).  We  analytically  answer: 

1)  Under  what  conditions  the  weak  viruses  (i.e.,  viruses 
that  can  be  robbed  by,  but  cannot  rob,  others)  will  die  out? 

2)  What  are  the  outcomes  when  the  viruses  are  (approxi¬ 
mately)  equally  powerful?  In  the  special  case  where 
multiple  viruses  may  coreside  on  computers,  we  analyti¬ 
cally  answer:  1)  Under  what  conditions  some  viruses  will 
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die  out?  2)  What  are  the  outcomes  when  the  viruses  are 
(approximately)  equally  powerful  in  the  setting  of  two 
viruses  (the  setting  of  three  or  more  viruses  is  left  as  an 
open  problem)?  Finally,  we  conduct  a  simulation  study, 
under  the  same  model  assumptions,  to  confirm  the 
analytical  results  offered  by  the  model. 

Third,  the  analytical  results  make  a  fundamental  con¬ 
nection  between  two  types  of  factors:  the  defense  capability 
and  the  connectivity  of  the  network  (via  the  largest 
eigenvalue  of  its  adjacency  matrix).  Therefore,  the  analytical 
results  not  only  deepen  our  understanding  of  multivirus 
spreading  dynamics,  but  also  offer  insights  and  guidance 
for  defense  (e.g.,  how  to  tune  the  parameters  by  adjusting 
the  defense  mechanisms  and  /  or  how  to  tune  a  network  so 
as  to  kill  all  or  some  of  the  viruses).  The  analytical  results 
are  often  “succinct"  (i.e.,  simple  expressions  of  the  para¬ 
meters)  enough  to  allow  us  to  draw  further  insights  by 
comparing  them. 

1.2  Related  Work 

There  are  mainly  two  approaches  to  understanding  security 
properties  from  a  whole-system  perspective:  epidemic-like 
and  attack-graph-like.  This  paper  falls  into  the  epidemic¬ 
like  approach. 

Epidemic-like  approach.  This  approach  aims  to  char¬ 
acterize,  among  other  things,  sufficient  conditions  (epi¬ 
demic  thresholds)  under  which  the  spreading  will  die  out. 
The  study  of  computer  virus  dynamics  was  started  by 
Kephart  and  White  [19],  [20],  who  adopted /adapted  some 
homogeneous  biological  epidemic  models  [22],  [21],  [3],  [2], 
[15].  Recent  studies  investigated  heterogeneous  models  (see, 
e.g.,  [35],  [4],  [25],  [30],  [31],  [29],  [10]),  which  are  interesting 
not  only  for  their  own  sake  but  also  because  of  their 
practical  potential  (because  any  topology  is  possible).  These 
studies  were  mainly  conducted  by  the  statistical  physics 
community  with  an  emphasis  on  power  law  networks  (see, 
e.g.,  [25],  [30],  [31],  [29]).  Very  recently,  computer  scientists 
investigated  spreading  in  arbitrary  networks  [34],  [13],  [10] 
(rather  than  just  power  law  topologies).  This  paper  moves  a 
significant  step  beyond  them  by  considering  multiple  viruses 
spreading  in  arbitrary  networks,  where  the  viruses  may 
combat  against  each  other  as  well. 

Attack-graph-like  approach.  In  an  attack  graph,  a  node  is 
the  state  of  a  network  and  an  arc  depicts  a  step  in  an  attack 
that  exploits  vulnerabilities  uncovered  by  vulnerability 
scanners  (cf.  [33],  [18],  [1]  and  their  follow-ons).  This 
approach  aims  to  identify,  among  other  things,  1)  attack 
paths  between  a  starting  node  or  a  set  of  starting  nodes  and 
some  target  node,  and  2)  guidance  for  security  hardening 
[27],  [24].  Recently,  Bursztein  and  Goubault-Larrecq  [8],  [7], 
[6],  [9]  introduced  the  novel  concept  of  anticipation  game, 
which  aims  to  introduce  game  theory  into  attack  graphs  so 
as  to  prove  that  a  safety  property  holds  and  further  to 
answer  questions  about  strategic  objectives  (e.g.,  what  is  the 
most  effective  patching  strategy?). 

On  the  relationship  between  the  two  approaches.  At  the 
current  stage  of  our  knowledge,  the  two  approaches  are 
incompatible.  Here,  we  discuss  some  informal  observations, 
while  pointing  out  that  it  is  an  important  future  work  to 
precisely  characterize  the  relationship  (e.g.,  how  can  we 
unify  them  into  a  more  general  framework?).  First,  the  two 


approaches  share  some  common  goals.  1)  Both  approaches 
take  a  whole-system  perspective,  rather  than  a  component 
perspective.  For  example,  the  vulnerabilities  used  in  the 
attack-graph-like  approach  can  be  used  as  an  input  for 
constructing  the  system  graph  used  in  the  epidemic-like 
approach.  2)  Both  approaches  are  useful  for  “what  if" 
analysis  as  we  show  in  this  paper.  Second,  the  two 
approaches  can  answer  different  kinds  of  questions  under 
different  premises.  1)  Attack-graph-like  approach  often 
assumes  known  vulnerabilities  that  can  be  obtained  using 
scanners,  and  thus  cannot  deal  with  zero-day  exploits. 
Epidemic-like  approach  in  principle  can  accommodate  zero- 
day  exploits,  perhaps  with  the  help  of  domain  expert 
knowledge.  2)  Attack-graph-like  approach  suffers  from  the 
state  explosion  problem  [1],  [11];  whereas,  epidemic-like 
models  can  handle  very  large  networks  (e.g.,  hundreds  of 
millions  of  nodes).  3)  Epidemic-like  approach  can  make  a 
fundamental  connection  between  the  defense  capability 
and  the  network  connectivity.  We  do  not  know  how  to 
accommodate  network  connectivity  into  the  attack-graph- 
like  approach. 

Outline.  In  Section  2,  we  present  and  explore  a  general 
model  of  multivirus  dynamics.  We  investigate  in  Section  3 
the  special  case  that  the  viruses  may  fight  against  each  other 
and  do  not  coreside  on  computers,  and  in  Section  4  the 
special  case  that  the  viruses  may  coreside  on  computers.  In 
Section  5,  we  draw  further  insights  from  the  analytical 
results.  In  Section  6,  we  report  our  simulation  study.  We 
conclude  the  paper  in  Section  7  with  open  problems. 

2  General  Multivirus  Dynamics  Model 
2.1  The  Master  Equation 

Parameters.  Let  J  denote  a  set  of  viruses  with  \J\  >2. 
Suppose  viruses  spread  according  to  a  finite  network 
G  =  (V,  E ),  where  V  =  (1, 2, . . . ,  n}  is  the  set  of  nodes  or 
vertices  and  E  is  the  set  of  edges  or  arcs.  We  call  G  the  system 
graph,  and  denote  by  A  =  [avu]  its  adjacency  matrix,  where 
avv  =  0  and  for  all  v  ^  u,  avu  =  1  if  and  only  if  (u,  v)  G  E.  This 
representation  accommodates,  and  thus  our  results  equally 
apply  to,  both  directed  and  undirected  graphs. 

A  system  graph  abstracts  a  networked  system.  A  vertex 
in  a  system  graph  my  represent  a  computer  with  some 
vulnerabilities,  and  an  edge  or  arc  captures  that  the 
exploitation  of  one  vulnerability  or  the  compromise  of  one 
computer  could  lead  to  the  exploitation/compromise  of 
another.  A  system  graph  may  be  obtained,  for  example,  by 
combining  the  output  of  vulnerability  scanners  and  systems 
configurations  (in  a  fashion  similar  to  [12],  [28],  [33],  [18], 
[1],  [11]),  by  analyzing  the  firewall  isolation/filtering  rules 
(e.g.,  which  computer  is  allowed  to  directly  access  which 
other  computers),  or  by  using  the  concept  of  dependency 
graphs  in  a  fashion  similar  to  [6].  It  is  worthwhile  to  point 
out  that  our  goal  is  to  investigate  the  dynamics  without 
making  any  assumption  about  the  system  graph,  which 
certainly  exists  and  can  be  arbitrary  in  practice. 

The  following  table  summarizes  the  main  parameters 
and  notations  used  throughout  the  paper. 
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A  =  |«h] 

adjacency  matrix  of  system  graph  G  =  ( V ,  E)  where  |  V\  =  n,  avu  =  1  if 
and  only  if  (u,  v )  €  E,  and  avv  =  0  for  all  v  e  V. 

d 

the  maximum  node  (in-)degree  in  G  =  (V.  E ),  d  =  maxv€y{y^ueV,  avu} 

Ai, . . . ,  An 

eigenvalues  of  adjacency  matrix  A  with  Ai  being  the  largest  (in  modulus) 

X\, ...  ,Xu 

the  geometrically  distinct  eigenvalues  of  adjacency  matrix  A 

j 

the  set  of  multiple  viruses 

Ke(z) 

the  real  part  of  a  complex  number  2 

X  (M) 

the  set  of  eigenvalues  of  a  square  matrix  M 

diag[xi,  •  •  •  ,  rc„] 

the  diagonal  matrix  of  the  given  diagonal  elements 

HI 

the  norm  of  vector  or  matrix  with  ||  •  ||2  denoting  the  2-norm 

R 

the  set  of  real  numbers 

The  following  table  summarizes  the  main  notations  used 
in  the  general  model  (Section  2)  and  their  simplified  versions 
(if  applicable)  used  in  its  special  cases  (Sections  3  and  4).  The 
simplification  is  for  the  purpose  of  more  succinct  representa¬ 
tions  of  the  results  and  easier  recognition  of  the  models  in 
which  the  results  are  obtained. 


j,  k,  1,  m 

subsets  of  J  with  m  =  0  (I ,  J,  L  C  J  are  used  in  Section  IV) 

the  probability/capability  that  a  k-infected  node  u,  (u,v)  e  E,  causes  a  j-infected 

node  v  to  become  1-infected;  in  Section  III,  is  simplified  as  7^  and 

is  simplified  as  dij;  in  Section  IV,  is  simplified  as  1?/^  where  /  C  J 

the  probability  at  time  t  that  a  j-infected  node  v  becomes  1-infected  because  of 

its  k-infected  neighbors;  in  Section  HI,  <^{j>  *s  simplified  as  Svj  and 
is  simplified  as  6v,t,j ;  in  Section  IV,  's  simplified  as  Svjtk 

the  probability/capability  that  a  j-infected  node  becomes  1-infected  because  of 

the  curing  of  viruses  belonging  to  k  =  j  \  1;  in  Section  III,  0^y  is  simplified 
as  /3 j;  in  Section  IV,  is  simplified  as  f3jtj 

relation  I 

(j,k,  1)  G  I  represents  the  state  transition  corresponding  to  <5*  j  k 

relation  R 

(j,  k.  1)  G  R  represents  the  state  transition  corresponding  to  /3]k 

j  (f ) 

the  probability  node  v  G  V  is  j-infected  at  time  t,  where  j  G  {1, . . . .  m};  in 
Section  in,  is  simplified  as  ivj 

The  state  transition  diagram.  We  consider  a  contin¬ 
uous-time  model.  At  any  time  t  >  0,  a  node  v  G  V  is  in  one 
of  the  following  m  states:  1, . . . ,  m  —  l,m  =  0,  where  the 
first  m  —  1  sets  are  nonempty,  distinct  subsets  of  J ,  and  m 
may  or  may  not  be  equal  to  2lJL  A  node  v  G  V  is 
susceptible  or  m-infected  if  it  is  not  infected  (i.e.,  secure 
but  vulnerable),  and  j-infected  if  it  is  infected  by  viruses 
belonging  to  j,  where  j  G  (1, . . . ,  m  —  1}.  We  use  "infected" 
and  "infectious"  interchangeably.  Each  node  v  G  V  changes 
its  state  according  to  a  directed  state  transition  diagram, 
which  should  not  be  confused  with  the  system  graph 
G=  (V,E).  (Fig.  1  is  the  general  state  transition  diagram 
for  the  case  of  \J\  =  2.) 

The  vertices  of  a  state  transition  diagram  are  the 
aforementioned  states  To  simplify  presentation, 

we  define: 

•  Infection-map  I:  (j,k,  1)  G  I  is  the  transition  of  node 
v's  state  from  j  to  1  because  of  its  neighboring  nodes 
it  in  state  k,  where  ( u,v )  G  E.  The  transition 
probability  at  time  t.  is 

^oj,kW  =  1  —  J^[  (1  —  ^j.kVkW):  (2-1) 

(u,v)€E 

where  d!  k  is  the  probability/ capability  of  a  single  k- 
infected  neighbor  u  causing  that  v's  state  changes 
from  j  to  1.  Note  that  possibly  d]  k  =  0  (e.g.,  when 
k  =  m)  and  thus  5*  jk(f)  =  0-  Note  also  that  in  the 
derivation  of  (2.1),  we  assumed  that  the  infection 
events  and  the  curing  events  are  independent. 


Fig.  1 .  State  transition  diagram  of  v  e  V  with  \J\  =  2.  We  can  define 
m  =  {},  1  =  {1},  2  =  {2},  and  3  =  {1,2}.  For  the  sake  of  generality,  we 
use  the  notation  [];  for  example,  ^{i} {p p}  's  short  for  {2j  or 
2j meaning  that  the  transition  from  state  {1}  to  state  {1,2}  may 
be  caused  by  the  neighboring  nodes  that  were  infected  by  virus  2  or  by 
viruses  1  and  2.  By  setting  certain  transition  probabilities  to  zero,  we  get 
special  models  that  will  be  investigated  for  deeper  results. 

•  Recovery-map  R:  (j,k,  1)  G  R  is  the  transition  of 
node  v's  state  from  j  to  1  because  of  the  curing  of 
viruses  belonging  to  k  C  j,  where  j  \  k  =  1.  The 
transition  probability  is  /3j,  k. 

Define  Pj,jk(i),  the  probability  that  a  node  v's  state 
transforms  from  j  to  1  at  time  t  as 

pi  (f\  =  f  @ j,k’  (t  k’  1)  g  R- 

”J'kU  |  <,*(*)’  (j’MeL 

Because  an  infected  node  may  become  susceptible  again, 
our  model  is  a  susceptible-infectious-susceptible  (SIS)  one. 

The  master  equation.  We  are  concerned  with  ivj (t),  the 
probability  v  G  V  is  j-infected  at  time  t  where  j  G  (1, . . . ,  m}. 
The  master  equation  of  the  nonlinear  dynamical  system  is 

=  ^2  Pi),l,kVlW 

(l,k):(l,kj)eR(Jl  ^2  2) 

—  ^uj.k'VjWi 

(k',l'):(j,k'!')el(jR 

for  dgF  and  j  G  (1, . . . ,  m}.  We  notice  that 

iv j  (£)  =  1  for  t  >  0. 

je{l,...,m} 

2.2  Preliminaries 

We  will  need  the  following  lemmas,  whose  notations  may 
be  treated  as  independent  of  the  main  body  of  the  paper. 
Lemmas  1  and  2  are  available  from  standard  textbooks,  but 
may  be  less  known  when  compared  with  Gershgorin's  disc 
theorem  [16],  the  Perron-Frobenius  theorem  [5],  and 
Gronwall's  inequality  [14],  which  will  be  used  in  our 
analysis  as  well. 

Lemma  1  (Theorem  2.5.3,  pp.  114,  [17]).  Let  C  =  (c.yj  G 

be  a  nonsingular  matrix  with  Cy  <  0,  i,j=l,...,n,  and 
i  j.  Then  the  following  statements  are  equivalent. 


XU  ET  AL.:  A  STOCHASTIC  MODEL  OF  MULTIVIRUS  DYNAMICS 


33 


•  C  is  an  M-matrix,  meaning  that  all  elements  of  C~l 
are  nonnegative. 

•  The  real  part  of  every  eigenvalue  of  C  is  positive. 

•  There  is  a  positive  definite  diagonal  matrix  P  such  that 
POP -1  is  strictly  column  (or  row)  diagonally 
dominant. 

•  There  is  a  positive  definite  diagonal  matrix  Q  such  that 
QC  +  CTQ  is  positive  definite. 

Lemma  2  (Theorem  4.2,  pp.  100,  [14]).  System  ftx(t)  =  Bx(t), 
with  x(t)  £  IR11  and  B  £  IR",n,  is  asymptotically  exponen¬ 
tially  stable  if  and  only  if  the  real  part  of  every  eigenvalue  of  B 
is  negative. 

Lemma  3.  Consider  a  vector  variable  u(t)  =  [ui, . . .  ,un(t)]T  £ 
IR”  satisfying 

d  n 

<  y ^CijXj(t),Vi  =  1, . . .  ,n,t  >  0, 

3= 1 

where  Cij  >  0  for  all  i  f  j.  Furthermore,  consider  the  following 
comparison  system: 

d  n 

^  Viif)  y  '  ^ijUjif)  ?  1, .  .  .  ,  Tl,  t  ^  0. 

3= 1 

If  Xi(  0)  <  yf  0)  for  all  i  =  1, ...  ,n,  then  xft)  <  yft)  holds 
for  all  i=  1, . . . ,  n  and  t  >  0. 

Proof.  Let  ig  be  the  index  of  the  component  which  satisfies 
Xi0(t)  =  Vi0(t)  for  the  first  time,  denoted  by  to-  We  have 


vciax.{'R.e(z)  :  z  £  X(w&  +  B  —  B  ),  w  £  A(A)}  <  0.  (2.4) 

The  Perron-Forbenius  theorem  [5]  says  that  Ai  is  a  real 
number.  Using  the  first  and  second  items  in  Lemma  1, 
(2.3)  implies  that  —  Ai©  —  B  +  B'  is  an  M-matrix  and  that 
there  is  a  positive  definite  diagonal  matrix  D  = 
diag[di, . . . ,  dm]  such  that  D[Ai©  +  B  —  B']  D_1  is  strictly 
diagonal  dominant  thanks  to  the  third  item  of  Lemma  1. 
Therefore,  D[w&  +  B  —  B']  IP1  is  strictly  diagonal  domi¬ 
nant  for  every  w  £  X(A).  This  leads  to  (2.4)  according  to 
Lemma  1  again. 

Second,  from  (2.1)  and  (2.2),  we  have  for  v  £  V  and 
j  £  {l,...,m-  1}: 

-vUj(f)  =  £  ^j,l,kUl(^)  +  flk,ltv,k(t) 

(l,k):(l,kj)el  (k,l):(k,lj)eR 

X  ^j.k'ViW-  XI  Pik’ivjit) 

(k',l'):(j,k',l')el  (k',l'):(j,k',l')sR 

—  y  ( y  a™'|^j.k^M,k(f)  j 

(i,k):(i,kj)ei  \nev  ) 

+  yy  /5i,iVk(f) _  yy  w 

(k,l):(k,lj)eR  (k',l'):(j.k',l')eR 

—  y  '  ^uuRkjU.k(f)  T  y  '  bkjiu]s(t)  bji(;j (t) . 

ueV'.k  k 

Now,  we  define  the  following  comparison  system: 

j,  X'cj  (t)  y  ]  kjXu.k(l)  4“  y^  hkj*Tv,k(f)  bj  Xyj  (j)  ■ 

at  U€V,  k  k 


d[xj0(t)  yip(t)\ 

dt 


—  cioio[xio(to) 

t=to 


Vi  o (*o)] 


+  X  cioj[xj(tQ)  -  yj{to)\  =  X^oiK'^o)  -  Vj{to)]  <  0, 
J#o  j¥=i  o 


owing  to  Xj(to)  <  yj(to)  for  all  j  f  ig.  This  implies  that 
xi„  ( t )  —  i/l0  (£)  does  not  strictly  increases  at  t  =  to,  which 
proves  the  lemma.  □ 

2.3  General  Sufficient  Conditions  for  the  Dying  Out 
of  All  Viruses 

Theorem  4  (A  general  sufficient  condition  for  all  viruses 
to  die  out  regardless  of  the  degree  of  initial  infection). 

Let  Ai  be  the  largest  (in  modulus)  eigenvalue  of  the  adjacency 
matrix  A  of  the  system  graph  G  =  (V,E).  Let  ©  =  [ay], 
where  k,  j  £  {1, . . . ,  m  —  1}  and 


ay  =  yy  :  (l,  k,j)  £  I$j ^(possibly  some  ay's  equal  to  zero). 

l 

Let  B  =  [by] ,  where  k.j  £  {1, ....  m  —  1} 

and  by  =  possibly  some  byS  equal  to  zero).  Let  B  = 
diag[ bi,..,Xm_i],  where  b'  =  E(k,i):(j,k,i)6R/?].k  f°r 

{1, . . . ,  m  —  1}.  If 

max{72e(^)  :  £  £  A(Ai©  +  B  —  B  )}  <  0,  (2.3) 

all  viruses  will  die  out  no  matter  how  many  nodes  are  initially 
infected  with  whichever  viruses. 

Proof.  First,  we  show  that  it  can  be  derived  from 
condition  (2.3)  that 


with  xvj (0)  =  h,j(0)  for  v  £  V  and  j  £  {1 _ , m  —  1}. 

Note  that  ivj(t)  <  xvj (t)  for  all  t  >  0.  By  letting  X ft)  = 
[x«j(*)]t)evje{i,...,m}'  we  §et  the  following  compact  form: 

jX{t)  =  AX(t)&  +  X(t) B  -  X(t)B'.  (2.5) 

Let  Xi,  X2,  ■  ■  ■ ,  Xff  be  the  geometrically  distinct  eigenva¬ 
lues  of  A.  Let  A  =  S'-1  JS  be  the  adjacency  matrix  A's 
Jordan  decomposition,  where  J  is  its  Jordan  canonical 
form  and  in  block-diagonal  form 


r  -h 


j  = 


Ji 


,  where  Jh 


Jh  J 


Aft  1  0 

0  Aft  1 
0  0  Aft 


0 

0 

0 


for  1  <  h  <  H. 


0  0  0  •  •  •  Aft 


Let  Y(t)  =  SX(t).  Then  (2.5)  can  be  rewritten  as 


j Y(t)  =  JY(t)&  +  Y(t) B  -  Y(t)B'.  (2.6) 

Write  Y(t)  in  row  as  Y(t)  =  [  Yj(f),  Y2(t), . . . ,  Yn(t)  ]T, 

where  n  =  |U|  and  each  row  Yv(t)  £  IRm_1.  Thus,  for 
each  Yv(t),  (2.6)  can  be  component-wisely  rewritten  as 
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jYv{t)  =  Yv(t)(Xi&  +  B  -  B')  +  Ut),v  G  V, 

where  £„(<)  =  0  if  Yv(t)  is  an  eigenvector  of  A  and  f„(t)  = 
Yu+i(<)0  otherwise.  Lemma  2  and  (2.4)  imply 
lim^oo  Y(t)  =  0  exponentially  if  £„(<)  =  0  for  all  t  >  0. 
By  induction,  we  can  prove  lim^oo  Y(t)  =  0  for  all  v.  This 
is  equivalent  to  lim^,*,  X(t)  =  0  exponentially.  From 
Lemma  3,  we  have  ivj(t)  <  Xvj (t)  for  all  v,  j,  and  t.  So, 
we  have  lim^,*,  iv j  (t)  =  0  exponentially  and  complete 
the  proof.  □ 

Now,  we  present  a  corollary  that  is  easier  to  use. 

Corollary  5  (a  simplified  sufficient  condition  for  all 
viruses  to  die  out  regardless  of  the  degree  of  initial 
infection).  If  for  all  j  G  {1, . . . ,  m  —  1},  it  holds  that 


Ai  < 


E(k,l):(j,k,l)eR  /^j.k  E(l,k):(l,k,j)eR  P\,k 


E 


(l,k):(l,kj)el  *d,k 


(2.7) 


all  viruses  zvill  die  out  no  matter  how  many  nodes  are  initially 
infected  with  whichever  viruses. 

Proof.  From  condition  (2.7),  we  have  for  all  je{l,..., 
m  —  1} 


-  bj  +  ^[Aiajk  +  bjk]  <  0.  (2.8) 

k 

Let  z  be  an  arbitrary  eigenvalue  of  matrix  A,  (-)  —  B  —  B'. 
The  Gershgorin  disc  theorem  [16]  says  |2+b'j  — 
Aiajj  -  bjj|  <  Ekyj[Aiaik  +  bjk]-  Thus,  TZe(z)  <  -b]  + 
EkIAiajk  +  bjk]/  where  the  right-hand  side  is  less  than 
zero  because  of  condition  (2.8),  which  implies  7 Ze(z)  <  0 
for  all  a  G  A(Ai©  +  B  —  B').  Via  Theorem  4,  we  complete 
the  proof.  □ 

Theorem  4  and  Corollary  5  say  that  when  the  para¬ 
meters  satisfy  the  respective  conditions,  all  viruses  will  be 
killed  no  matter  how  many  nodes  are  initially  infected. 
Therefore,  the  conditions  may  be  overdemanding  when 
only  some  (but  not  many)  nodes  are  initially  infected  (i.e., 
light  initial  infection).  In  what  follows  we  give  a  sufficient 
condition  under  which  the  viruses  die  out  in  the  case  of 
light  initial  infection;  this  sufficient  condition  is  close  to 
being  necessary. 

Theorem  6  (a  sufficient  condition  for  all  viruses  to  die  out 
in  the  case  of  light  initial  infection).  Let  ©  =  [akj]  G 
ipm— i,ro— l,  wjlere  k  j  e  J  m  —  1}  and  akj  =  dJmk.  If 


Jj.xvj(f)  —  ]  awi'^m,k‘2'«,k (t)  +  y  ]  /^k.l xv,k(t) 

aZ  uEV,  k  (k,l):(k.lj)eR 

_  yy 

(k,l):(j,k,l)6R 

y  ]  ^ziuhkjT1u,k(f)  +  y  ]  bkj .X',,  k  ( i )  bj T'v.j  ( t  )  • 

u€V.k  k 

(2.11) 


Let  X{t)  =  [^j(f)]„6vj6{1,...,m}-  Then,  (2.11)  can  be  re¬ 
written  as 


jfX (t)  =  AX{t)0  +  X(t) B  -  X(t)B'.  (2.12) 

Reasoning  in  a  fashion  similar  to  the  proof  of  Theorem  4, 
we  can  prove  (2.9)  is  sufficient  for  the  local  stability  of 
system  (2.2). 

Condition  (2.10)  says  that  the  real  part  of  some 
eigenvalue  of  linear  system  (2.11)  is  greater  than  zero. 
Noticing  that  (2.11)  is  the  linearization  of  (2.2)  and 
according  to  the  stable  manifold  theorem  [32],  if  the 
linearized  system  has  at  least  one  eigenvalue  with  a 
positive  real  part,  then  the  original  system  is  unstable. 
Thus,  we  conclude  that  the  original  system  (2.2)  is  unstable 
surrounding  ivj  =  0  for  all  j  C  J.  We  complete  the  proof.  □ 
In  a  similar  fashion,  we  have  the  following  corollary: 

Corollary  7  (a  simplified  sufficient  condition  for  the 
viruses  to  die  out  when  the  initial  infection  is  light).  If 

for  all  j  G  {1, . . . ,  m  —  1}, 


Ai  < 


E(k,l):(j,k,l)eR  (^j.k  E(l,k):(l,kj)eR  ^Lk 


E 


(k):(m,kj)el  ^!n,k 


(2.13) 


then  all  viruses  will  die  out  in  the  presence  of  light  initial 
infection. 


The  above  analytical  results  allow  us  to  draw: 

Insight  8.  The  above  analytical  results,  especially  Corollaries  5 
and  7,  can  quantitatively  guide  the  deployment /adjustment  of 
defense  mechanisms  (e.g.,  decreasing  d]  k  by  enforcing  deeper 
packet  inspection  over  network  edges,  and/or  increasing  /3j  k  by 
deploying  more  antivirus  softwares  of  different  vendors,  and/or 
decreasing  Ai  by  disabling/filtering  some  network  edges)  so  as 
to  satisfy  the  sufficient  conditions  and  to  kill  the  spreading  of 
all  viruses. 


3  Deeper  Results  in  the  Case  of  Disjoint 
Spreading 


ma,x{TZe(z)  :  z  G  A(Ai©  +  B  —  B  )}  <  0,  (2.9) 

then  all  viruses  zvill  die  out;  if 

max{7 Ze(z)  :  2  G  A(Ai©  +  B  —  B  )}  <  0,  (2-10) 

then  the  viruses  do  not  die  out. 

Proof.  For  v  G  V  and  j  G  {1, . . . ,  m  —  1},  consider  the 
following  linear  system  near  ivnl  =  1  or  equivalently 

iv j  =  0  for  all  v  G  N  and  j  1 _ _  rn  -  1,  by  neglecting 

all  terms  with  order  higher  than  1 


In  this  section,  we  consider  the  special  case  that  the  fighting 
viruses  do  not  coreside  on  computers,  which  allows  us  to 
simplify  some  notations.  Specifically,  we  say  a  node  v  G  V  is 
susceptible  if  v  is  not  infected,  and  j-infected  if  v  is  infected 
by  virus  j  where  j  G  J;  these  states,  respectively,  corre¬ 
spond  to  0-infected  and  {j}-infected  in  the  general  model. 
We  are  still  concerned  with:  sv(t),  the  probability  node  v  G 
V  is  susceptible  at  time  t,  which  corresponds  to  iv${t)  in  the 
general  model;  ivj(t ),  the  probability  v  G  V  is  j-infected  at 
time  t.  where  j  G  J ,  which  corresponds  to  Lwft)  in  the 
general  model.  The  invariant  is  sv(t)  +  E/_i  EjE)  =  T 
Other  parameters  are: 
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Afcmt(j)  =  {/  '■  there  is  an  arc  from  {j}  to  {£} 
in  the  state  transition  diagram}. 

In  other  words,  £  E  jV}n(j)  means  virus  j  can  rob  virus  £, 
and  l  E  M out  ( j)  means  virus  £  can  rob  virus  j.  For  example, 
corresponding  to  the  state  transition  diagram  in  Fig.  2,  we 
have  Uin(l)  =  Uouti  1)  =  {2}  and  J\f in(2)  =  M mt(2)  =  {1} 
due  to  the  symmetry  that  the  viruses  can  rob  each  other. 
Fig.  2.  A  simplified  state  transition  diagram  of  node  v  s  V  with  \  J\  =  2.  Then,  the  general  master  equation  (2.2)  becomes  for  v  (zV 


•  / 3f  The  probability/capability  a  j-infected  node 
becomes  susceptible  at  any  time,  where  j  E  J .  Note 
that  pj  corresponds  to  /J^.j  ^  in  the  general  model. 
We  may  call  the  diagonal  matrix  B  =  diag[/?i , . . . ,  /3|j|] 
the  "cure  matrix." 

•  t f  The  probability /capability  a  j-infected  node  u 
successfully  infects  a  susceptible  node  v,  where 
(u,  v)  E  E  and  j  G  J .  Note  that  7 y  corresponds  to 

in  the  general  model.  We  may  call  the  diagonal 
matrix  T  =  diag[7i, . . . ,  y\j\\  the  "infection  matrix." 

•  dey.  The  probability/capability  that  virus  j  success¬ 

fully  "robs"  virus  £  (i.e.,  a  node  v's  state  changes  from 
/-infected  to  j-infected),  where  (u,  v)  G  E,  £,j  e  J , 
and  £  /  j  (i.e.,  a  virus  does  not  rob  itself).  Note  that 
dej  corresponds  to  dj}j  ^  in  the  general  model.  We 
may  call  R  =  the  "robbing  matrix." 

•  Svj(t):  The  probability  a  susceptible  node  v  becomes 
j-infected  at  time  t,  with 

=  1  —  (1  —  7 flu, ](£))■ 

(u,v)€E 

Note  that  6vj(t)  corresponds  to  in  the 

general  model. 

•  6v,ej(t):  The  probability  an  /-infected  v  becomes 
j-infected  at  time  t,  with 

9v/,j{t)  =  1  —  (1  — 

(u,v)€E 

Note  that  9v/j(t)  corresponds  to  6v%Uj}W  in  the 


jtsv(t)  =  (  -  Jsv(i)  +  ( 


j,  —  8v,j(t)sv(t)  +  (  9v,e,j(t)  ]  iv,e(t)  (3.1) 

\eeMinU)  / 

—  |  j  j  3  ^  J- 

.  V  teNwtij)  J 


3.1  More  Succinct  Sufficient  Conditions  for  the 
Dying  Out  of  All  Viruses 

Consider  the  special  case  where  1  =  {/},  k  =  {k},  j  =  {j} 
in  the  general  model.  We  have  /3™m  =  /3j,  and  jf]  k  =  0  for 
all  other  cases;  we  have  dW  =  7^,  iTjj  =  for  1  /  m  and 
1  /  j,  and  i9j  k  =  0  otherwise.  The  following  are  special 
cases  of  Corollaries  5  and  7,  respectively. 

Corollary  9.  (more  succinct  sufficient  condition  for  all  viruses  to 
die  out  regardless  of  the  degree  of  initial  infection)  If  for  all 


j  G  J” ,  Ai  < 


then  all  viruses  will  die  out  no  matter  how  many  nodes  are 
initially  infected. 

Corollary  10  (more  succinct  sufficient  condition  for  all 
viruses  to  die  out  in  the  case  of  light  initial  infection). 

Suppose  some,  but  not  many,  nodes  are  initially  infected.  If  Xi  < 
zf  holds  Vj  =  1, . . . ,  \  J\,  then  the  viruses  will  die  out.  If3j  E 
(1, . . . ,  \J\}  such  that  Ai  >  then  the  viruses  do  not  die  out. 
Insight  8  is  naturally  inherited  by  the  above  results. 


general  model. 

The  state  transition  diagram  depicted  in  Fig.  1  is  now 
simplified  as  Fig.  2. 

For  a  given  state  transition  diagram,  we  define 

JP in  (j)  =  {£  ■  there  is  an  arc  from  {/}  to  {j} 
in  the  state  transition  diagram}, 


3.2  What  if  Some  Viruses  Are  More  Powerful  than 
Others? 

Suppose  the  set  of  viruses  J  can  be  partitioned  into  two 
nonempty  sets:  J\,  the  set  of  weaker  viruses,  and  the  set 
of  stronger  viruses,  such  that  in  the  state  transition  diagram, 
there  are  no  arcs  starting  at  any  {/},  £  E  J 2,  and  arriving  at 
some  {j},  j  G  J\.  In  other  words,  a  state  transition  diagram, 
after  omitting  the  susceptible  state,  possesses  the  "weaker- 
stronger"  structure  depicted  in  Fig.  3a.  Note  that  the  viruses 


Fig.  3.  J  may  be  partitioned,  based  on  their  robbing  capabilities,  into  weaker  viruses  set  c  J  and  stronger  one  J 2  =  J\J\. 
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belonging  to  3\  may  be  able  to  rob  each  other,  but  are 
unable  to  rob  any  virus  belonging  to  3  2-  On  the  other  hand, 
viruses  belonging  to  3i  may  be  able  to  rob  some  viruses 
belonging  to  3  \,  and  may  additionally  rob  other  viruses 
belonging  to  J2  as  well. 

Although  there  may  be  many  ways  to  partition  J  while 
preserving  the  weaker-stronger  structure,  we  may  focus  on 
the  partitions  where  \Ji\  is  maximal  because  the  weaker 
viruses  may  die  before  the  stronger  ones  die.  However,  we 
should  be  aware  that  there  may  be  multiple  ways  of 
partitioning  while  maximizing  \J i\.  For  example,  in  the 
example  shown  in  Fig.  3b  with  J  =  {1, 2, 3, 4, 5},  virus  5  can 
be  robbed  by  viruses  1  and  2;  virus  1  can  be  robbed  by  virus  3; 
virus  2  can  be  robbed  by  virus  4.  In  this  example,  there  are  two 
ways  of  p  artitioning  while  maximizing  |  | :  =  {1,2, 3, 5} 
in  one  case  (within  red-color  curve)  and  3\  =  (1,  2, 4,  5}  in 
the  other  (within  blue-color  curve).  Therefore,  we  can  define 


JJ  =  max{j'  C  J  G  J' J  ej\3',{£} 

\J'\ 

pointing  to  {/}  in  the  state  transition  diagram}. 

The  following  results  concern  the  fate  of  the  viruses 
belonging  to  any  3  \  G  JJ  of  interest. 

Theorem  11  (sufficient  condition  for  the  dying  out  of 
weaker  viruses).  If  for  all  j  G  J\, 


Ai  < 


Pj 

v  +  Efe  j,  %  ’ 


(3.2) 


then  all  viruses  belonging  to  J 1  will  die  out  regardless  of  the 
number  of  nodes  initially  infected. 

Proof.  To  prove  the  theorem,  we  also  first  prove  the 
following  result:  If 


max{7?.] (z)  :  z  G  A(Ai(Fi  +  f?n)  -  Hi)}  <  0,  (3.3) 

then  all  viruses  belonging  to  J \  will  die  out  regardless  of 
the  degree  of  initial  infection.  Since  condition  (3.2) 
implies  the  condition  (3.3),  we  complete  the  proof. 

Now  we  prove  that  if  condition  (3.3)  holds,  then  all 
viruses  belonging  to  3i  will  die  out  regardless  of  the 
degree  of  initial  infection.  For  j  G  J\,  we  have 


fiv,j  (f)sv(f)  +  f  ( Ifiu.t  if) 

TeAUi) 


I  Pi  "F  I  V i(P) 

\  teA fautU)  J 

—  P  '  ovuiu.j  (f ) A  "F  f  '  0:Vni)i.  jiU  j  (t) ivj  (1 ) 

ueV  u€V,(€A 


—  'y  aVuiu,j(i)y-j  T  ^  ^  '  Ovu'b ijiujff)  Pjvjiff 

u€V  u€V  leji 


because  =  0  for  all  j  G  J\  and  £  J \.  Write  the 
robbing  matrix  R  as  R  =  where  R\  \  corresponds 

to  the  virus  subset  3  According  to  the  definition  of  3i, 
we  have  J?2i  =  0.  Let  Tj  and  Bi  be  submatrix  of  T  and  D 
corresponding  to  J\,  respectively.  Consider  the  following 
comparison  system  of  Xx (t)  =  [xVtJ(t) ]v€Vj€ji\  (t)  = 
AXi(t)(ri  +  i?n)  —  Xi(t)Bi.  We  have  <  xvj(t)  for 


all  v  G  V,  j  G  J and  t  >  0  when  they  take  the  same  initial 
value,  namely  ivj( 0)  =  xvj(0),  thanks  to  Lemma  3.  Let  A  = 
S~lJS  be  A's  Jordan  decomposition  and  Y\  =  SX\,  we 
havejfYi  =  JY\(Yi  +  Rn)  —  YiB\.  Reasoning  in  a  fashion 
similar  to  the  proof  of  Theorem  4,  we  can  conclude  that 
under  condition  (3.3),  max7?.e(A„(r1  +  Rn)  —  Bi)  <0  for 
all  v  G  V.  This  means  limt_(0C  Yi (t)  =  0.  Therefore, 
lim^oo  X\{t)  =  0  and  thus  lim^oo  i vj(t)  =  0  for  all  v  G  V 
and  j  G  Ji .  This  completes  the  proof.  □ 


Theorem  12  (sufficient  condition  for  the  dying  out  of 
weaker  viruses  in  the  case  of  light  initial  infection).  If 

V)  G  J \,  Ai  <  — ,  then  the  viruses  belonging  to  3 1  will  die  out 
in  thecaseofliglit  initial  infection  with  viruses  belonging  to  3 1- 

Proof.  Note  that  for  any  j  G  3\,  if  there  exists  I  with  1 >  0, 
then  £  G  J Considering  only  the  viruses  belonging  to 
3\  in  (3.1),  and  ignoring  the  terms  of  order  higher  than  2, 
we  obtain  for  j  G  3 1 

—  iv  j (t)  —  ^  ovuyjiuj(t) sv (f)  +  ^  ^  d £jG,vuivfft)iuj)t) 

at  u€V  ueVMJi 

~  j  Pi  3  J 

y  t€zjf  out  (j)  J 

—  ^  ^  Q'vurYj'iu,j(t)  fij'i'vji.t)  • 

u€V 

Consider  the  following  comparison  system  for  j  E 
iVv,j(t)  =  Jfu€Vavu'yjyuJ(t)  -  PjyvJ(t).  Reasoning  in  a 
fashion  similar  to  the  proof  of  Theorem  6,  we  can  show 
that  under  the  given  condition  Ai  <  ,  lim^oc  Dv,j(f)  =  0. 

Therefore,  ivj(t)  =  0  for  j  G  3 1  are  locally  asymptotically 
stable.  We  complete  the  proof.  □ 

The  above  analytical  results  allow  us  to  draw: 

Insight  13.  When  it  is  impossible  to  kill  all  viruses,  Theorems  11 
and  12  offer  guidelines  for  quantitatively  tuning  parameters 
(as  mentioned  above)  to  kill  the  weaker  viruses. 

Because  the  weaker-stronger  structure  is  not  based  on 
the  curing-capabilities,  but  rather  based  on  the  robbing- 
capabilities,  of  the  viruses,  this  offers  an  extra  leverage  to 
the  defender: 

Insight  14.  Even  if  we  are  unable  to  directly  kill  the  weaker 
viruses,  we  may  exploit  the  stronger  viruses  to  help  alleviate 
the  weaker  viruses  when  the  stronger  viruses  are  actually 
easier  to  kill. 


3.3  What  if  the  Viruses  Are  (Approximately)  Equally 
Powerful? 

Here  we  address  the  question:  What  if  the  viruses  are 
(approximately)  equally  powerful? 

Theorem  15  (outcome  of  viruses  being  equally  powerful). 

Suppose  for  all  j,  j'  G  3  with  j  f=-  j' ,  there  exist  (3,  7,  i9  such 
that  (3j  =  fly  =  /3,  7 j  =  7 f  =  7,  and  djj  =  dfj  =  0.  If 


Ai  < 


P 

7  +  0  ’ 


(3.4) 


then  Vi>  G  V  and  j,j'  e  3 ,  j  f  /,  we  have  lim^oo  | ivj(t)  — 
1  =  0. 
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Proof.  Note  that 


Therefore,  we  have 


"h  ^  ' [Py.tj (f )  Py/.f  (^)]L>.L  fv,/(^)) 

(€j 

~  E 0vjAt)[ivj(t)  -  iv, ,/(*)]• 
e€j 

Since  1 6vj(t)  -  6vj(t)  \  <  Jf,ueV  avv\iu,j{t)  -  iuj{t)\l  and 
1 9v,ej(t)  ~  Ov,t,f(t) |  <  Y,uevavu‘V\iuAt)  ~  *«,/(*)  I'  we  have 

,,  l^uj(f)  (i)  |  7:  (7  "h  $)  ^  |  iuj  (t)  iuj(t)  | 

ueV 

i  |  iy.j  (  0  fu,/(f)|- 

Consider  the  following  comparison  system:  Ttx(i)  = 
[—/?/„  +  (7  +  i9)^4]rc(t),  where  In  is  the  n  x  n  identity 
matrix,  and  A  is  the  adjacency  matrix.  Reasoning  in  a 
fashion  similar  to  the  proof  of  Theorem  9,  we  see  that 
condition  (3.4)  implies  that  the  comparison  system  is 
asymptotically  stable,  and  thus  limf_,00|i0(j(f)  —  ivj(t)\  = 

0  Vi>  e  v,  V),  /  e  j.  □ 


Theorem  16  (outcome  of  viruses  being  approximately 
equally  powerful).  Suppose  for  all  £,  j,  (! ,  j'  £  J  with  j  f  l 
and  /  f  £' ,  there  exists  some  e  >  0  such  that  \/3j  —  (3e\  <  e, 
It j  -  7t|  <  e,  and  |i%  -  tV,y|  <  e.  If 


Ai  < 


ft 

7j  +  %  ’ 


(3.5) 


holds  for  some  £,j,  £  f  j,  then  there  exists  d*  independent  of  e, 
such  that  lim^ool^  j(t)  —  ivj(t)  |  <  d*e  for  all  v  £V  and  j, 
/  G  J. 

Proof.  Note  that,  according  to  (3.1) 


"L  ^  '  (PyJ:.j  (1)  Py/J'  (^))  iv/A)  ^  Py.jJ'(P)  (1 (1)  *„,/  (1)) 

teJ  £eJ 

—  Pj(ivj(t)  —  ivj(t ))  +  (/3y  —  )■ 

Note  also  that 


< 


it  (1  Tjhij'W)  ii  (-*- 

(u,v)€.E  ( u,v)€E 


+ 

< 


n  n-  TyWW)  n  n  tiWW) 

(u,v)€E  (u,v)&E 

7j  E  &VU  |  (^)  E  E  ^vu  h  -  7/ 


E  Tj  ^  ^  &VU.  |  j  (^)  V/(^)|  T-  dc, 


u€V 

where  d  is  the  largest  node  (in-)degree,  namely 

d  =  maxv{|{u  :  (w,v)  G  -£7} | }.  Similarly 


<  |<5„y(f)  -  <S„,y(f)|  +  y^|6>„yy(t)  -  0„y,y(f)| 

t€j 

—  _  w(^)i _  /?j|vj(^)  _  w(^)i 

i€j 

T"  |/3y  i  7:  (7?  T  ^ Cj)  Ea™l*uj(i) 

U€V 

—  Pj\^vj{t)  —  ivjif)  |  +  (|7T|  +  2)de. 

Let  1  =  [1, 1, ... ,  1]T.  Consider  the  following  comparison 
system: 

^x(t)  =  [— /3jl„  +  (7 ,•  +  ^j)^]x(<)  +  d'el,  (3.6) 

for  some  given  £,j  £  J,£  f  j,  and  d!  =  d(|  J\  +  2)  which  is 
clearly  independent  of  e.  Let  W3  =  —[3jln  +  (7 j  +  i?^)A. 
From  condition  (3.5),  we  see  that  —  Wj  is  an  AL -matrix,  and 
thus  Lemma  1  says  that  there  exists  positive  definite 
diagonal  matrix  Zj  such  that  ZfWj  +  Wj  Zj  is  negative 
definite.  Let  p™  and  gf  be  the  smallest  (in  modulus) 
eigenvalues  of  l/2(ZjWj  +  Wj Zf  and  Zj,  respectively. 
Then,  we  have 


dt 


—  xT(t)Zjx(t)  =  xT(t)[ZjWj  +  Wj  Zj]x(t)  +  2d'exT(t)l 
<  2 p™xT (t)x(t)  +  exT(t)x(t)  +  —  6 -  n 


< 


2 p™  +  £  T  ( d'e)2n 

11  ry  11  a  (t)ZjX(t)  +  — — , 


for  all  —2 pf  <  e  <  0.  Setting  e  =  —pf  and  using 
GronwalTs  inequality  [14],  we  have 


xT  (t)Zjx(t)  <  exp 


2  pf  +  £ 


■7112 


(d’efn  11^-11, 

£  -2 pf  -  £ 


t  )xT  (0)Zjx(0) 


1  —  exp 


2  pf  +  £ 

Tz 


Jll2 


<  exp(  jt^t rt  jxT(0)Z3x(0) 


Jll2 


(d'efWZ^n 

(Pf)* 


1  —  exp 


Pj 


\Zi\ 


Therefore,  we  have 


lim^oolla:] 


II2  -  IE)  xT(t)ZjX(t)  ^ 

||2  s  iim^oc - — - s 


(d!e)\\\Zj\\2 
(pf)2  p? 


One  can  see  that  condition  (3.5)  implies  that  the  compar¬ 
ison  system  (3.6)  satisfies  limf^00||a:(<)||2  <  d*e  for  some  d* 
independently  of  e.  This  leads  to  lim(^0O|i,!J(t)  —  ivj(t)  |  < 
d*e  for  all  v  £  Z,  where  d*  can  be  estimated  as 


d*  = 


d! 

E 


@v,£,j'(t)  |  E  ^  ^  Q'vu$£,j  | H- de. 

u€V 


This  completes  the  proof.  □ 

The  above  analytical  results  allow  us  to  draw: 
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Insight  17.  When  the  viruses  can  rob  each  other,  if  the  viruses 
1)  have  (approximately)  equal  attack  power  and  (approxi¬ 
mately)  equal  robbing  power  and  2)  are  (approximately) 
equally  defended  against,  then  they  will  have  (approximately) 
the  same  fate,  no  matter  how  many  nodes  they  initially  infect 
and  no  matter  they  ivill  die  out  or  not.  Moreover,  if  they  die 
out,  they  will  die  out  (asymptotically)  at  the  same  time. 

4  Deeper  Results  in  the  Case  of  Nondisjoint 
Spreading 

Now  we  consider  the  other  special  case,  where  one 
computer  gets  incrementally  infected  (i.e.,  by  one  virus 
after  another  rather  than  concurrently)  and  the  viruses  on  a 
computer  get  sequentially  (rather  than  simultaneously) 
cured.  We  use  capital  letters  such  as  I  (/  C  J)  to  represent 
states  in  the  state  transition  diagram,  and  simplify  other 
parameters  as  follows: 

1.  The  infection-map  I  in  the  general  model  is  now 
simplified  as  binary  relation  (I,  j)  £  I  where  j  £  J 
and  j  £  I,  meaning  that  a  node's  state  changes  from 
I  to  I U  {;}  because  of  infection  by  a  {j}-infected 
node.  We  hereby  denote  the  infection  probability  by 
t)f  j.  The  probability  of  node  v's  state  changes  from  I 
to  I U  {j}  is  simplified  as 

=  1  —  (1  — 

(u,v)€E 

(4.1) 

2.  The  recovery-map  R  in  the  general  model  is 
simplified  as  a  binary  relation  (I,  j)  £  R,  where 
j  £  I,  meaning  that  a  node's  state  changes  from  I  to 
I  \  {j}  because  of  curing.  We  hereby  denote  the  cure 
probability  by  /3jj. 

The  state  transition  diagram  in  Fig.  1  is  simplified  as  the 
one  in  Fig.  4.  The  master  equation  (2.2)  is  now  simplified  as 

yrj  iv.l  ( t)  —  ^  '  bv.J .idv.J  if)~F  f  ^  @K,£ i v,K  if ) 

(J,fc)el:.7|j«=l  (K/)eR:K\{(}=l 

(I,k)el  (I,£)€R 

(4.2) 

where  v  G  V,  I  C  J,  and  7/0. 

4.1  More  Succinct  Sufficient  Conditions  for  the 
Dying  Out  of  All  Viruses 

Because  the  present  model  is  a  special  case  of  the  general 
model,  we  can  immediately  obtain  the  following  results  as 
corollaries  of  5  and  7,  respectively. 

Corollary  18  (sufficient  condition  for  all  viruses  to  die  out 
regardless  of  the  degree  of  initial  infection).  If  VI  <Z  J 

where 

q  ^  0  Ai  <  S(J,j)6 RAj  ~  E(JJ)£R:J\{j}=J  Pjj 
*Ej(K,k)el:K  (J{/fc}=/  ^K’k 

where  J,K  C  J  and  j ,  k  £  J,  then  all  viruses  will  die  out  no 
matter  how  many  nodes  are  initially  infected. 


1  Aa.i  ^v,{i},2^) 


Fig.  4.  Simplified  state  transition  diagram  of  two  viruses  with  nondisjoint 
spreading. 

Corollary  19  (sufficient  condition  for  all  viruses  to  die  out 
in  the  case  of  light  initial  infection).  All  viruses  will  die 
out  in  the  presence  of  light  initial  infection  if:  1)  for  all  J  = 
{j}  where  j  £  J ,  it  holds  that 

\  „  %1j  _  Pjj 

M  s - > 

and  2)  for  all  I  C  J  with  |/|  >  1,  it  holds  that 

Phi  ~  >  °- 

Ud)e  R  (Ti)e 

Note  that  in  the  latter  case  j  =  0  because  |T|  >  1. 

Insight  8  is  also  naturally  inherited  by  the  above  results. 

4.2  Under  What  Conditions  Some  Viruses  Die  Out? 

Now  we  present  sufficient  conditions,  under  which  a  subset 
of  viruses  J  C  J  will  die  out. 

Theorem  20  (sufficient  condition  for  the  dying  out  of 
some  viruses  no  matter  how  many  nodes  they  initially 
infected).  For  a  set  of  viruses  J  C  J,  if  for  every  I  C  J  with 
i  n  J  0 

yy  Pij  ^  y^  Pj'j + 

id  ^ 

^  idK.k  +  d  ^2  dx,k, 

k£j,Ku{k}=I  ky.J,Ku{k}=I 

where  d  is  the  maximum  node  (in-)degree  and  k  K,  then  all 
viruses  belonging  to  J  zvill  die  out  no  matter  how  many  nodes 
are  initially  infected. 

Proof.  Note  that,  according  to  (4.1)  for  each  K  C  J  and 
k  £  J  with  k  £  K,  we  have 

t>v,K,k  (tMt)  <  £  &VU  (j^)^u,{k}  (^)  • 

u£V 

Furthermore,  for  each  K  c  I  and  K  U  {k}  —  I,  we  can 
make  the  following  estimations: 
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'  O.vu:0  K jAu.{ k'fdj  •,  k  £  J, 
ueV 

d'&K^Kit),  k(£  J. 

Let  X(t)  =  [. XvJ(t)\  with  v  £  V  and  If]  J  f  0.  Let  T  = 
[7 ki\  with  K,  I  £  {L  C  J  :  L  n  J  f  0},  where 


r  <  dvtiKtkiv,K{t),  k  (£  J, TVp  J  ±  0, 

^v,Kj(t)iv,K(t)  S  —  avu$K,jiu,{k}(t)i  k  £  7VP  J  =  0, 

I  ueV 

l.  =  0,  otherwise. 

Let  Z(t)  =  [Zt,  /(t)]  with  v  G  TV  and  I  fl  J  7^  0.  Let  f  = 
[7av]  with  K,I  £  {L  £  J  :  in  J  /  0},  where 


7av  = 


1?AW,*»  K  =  {fc}  ^  7  n  J,  (7  \  {fc},  fc)  G  I, 
0,  otherwise. 


Let  A  =  [Aaj]  with  K ,  7  G  (L  C  J  :  L  fl  J  ^  0},  where 


A  ki  = 


■&K :,k,  k<fJ,k<£K,  K  U  {k}  =  7,  (K,  k )  G  I, 
0,  otherwise. 


' )KI  = 


Consider  the  following  comparison  system  of  the 
master  equation  (4.2) 


dK’,k,  K={k}  C  I,  K'=I  \{k},K'  D  J=9,  (• K fc)  Gl, 
0,  otherwise. 


Let  A  =  diag[Ax/]  with  K,  I  G  {L  C  J  :  L  fl  J  ^  0},  where 
'0*,*,  77U{fc}  =  7,  k  £  J,  (/V,  fc)  G  I, 


Ar,-r  = 


0, 


otherwise. 


Define  a  comparison  system  to  the  linearized  system 
near  ivj(0)  =  0  for  all  v  G  V  and  1 1~)  J  7^  0 


dt. 


Z(t)  =  AZ(t)f  +  dZ(t) A  +  Z(t)B  -  Z(t)B\ 


jX{t )  =  AX(t)T  +  cLY(f)A  +  X(t)B  -  X(t)B \ 

with  B  =  [bxi]  and  B1  =  diag[&^],  where  K,I  G  {L  £  J  : 
Lfl  J/0}  and 


bni  = 


Pkj, 

0, 


/V  =  TU{i},(L,i)GR, 
otherwise , 


and  b'j  =  /T/j. 

(i,i)e  R 


It  can  be  seen  that  Xvj(t)  >  ivj(t)  for  all  t  >  0,  n  G  TV,  and 
Tn  J/0  as  long  as  .^/(O)  >  M(0).  Consider  the 
stability  of  the  following  component  system: 

jty(t)  =  y(t)[\1r  +  dA  +  B-B'], 

with  y(t)  being  /t-dimensional  row  vector.  Similar  to  the 
proof  of  Corollary  5,  we  conclude  that  condition  (4.3) 
implies  that  the  real  parts  of  the  eigenvalues  of  matrix 
AiT  +  dA  +  B  —  B'  are  all  negative.  According  to  Lem¬ 
ma  2,  we  have  lim^,*,  y(t)  =  0.  By  the  same  fashion  of 
reasoning  as  in  the  proof  of  Theorem  4,  we  have 
lini/^oo  X(t)  =  0.  According  to  Lemma  3,  we  have 
lini/^oo ivy(t)  =  0  for  all  If) .7^0.  This  completes  the 
proof.  □ 


Theorem  21  (sufficient  condition  for  the  dying  out  of 
some  viruses  that  did  not  infect  many  nodes  initially). 

For  a  set  of  viruses  J  £  J,  if  for  each  I  with  I  f)J  $ 

y  Phj  >  y  Pj'j + ai 

J€!  J'\{f}=I 

y  $K,k + d  y  ^K,k, 

keJ.K  p|  J=9,KU{k}=I  k<£J,KU{k}=I 

then  all  viruses  belonging  to  J  will  die  out  when  they  did  not 
initially  infect  many  nodes. 

Proof.  Note  that  a  linearization  of  the  master  equation  (4.2) 
near  ?'„_/( 0)  =  0  for  all  v  G  TV  and  I  p  J  f  0,  neglecting  all 
terms  with  order  higher  than  1,  leads  to 


where  B  =  [bKi\  and  B'  =  diag[6^]  with 


bni 


@K,j,  K  —  I  \J  {j},  (K,j)  G  R,  /  n 

Or  —  y  ,j 

0,  otherwise ,  —~f 


By  the  same  fashion  of  reasoning  as  in  the  proof  of 
Theorem  20,  we  can  conclude  the  local  stability  of  ivy  =  0 
for  all  v  G  TV  and  7p  J  f  0.  □ 

The  above  analytical  results  allow  us  to  draw: 

Insight  22.  When  the  goal  is  to  kill  a  subset  of  viruses  (e.g.,  the 
damaging  ones),  Theorems  20  and  21  offer  guidance  for 
adjusting  the  defense  (as  mentioned  above)  toivard  the  goal. 


4.3  What  if  the  Viruses  Are  (Approximately)  Equally 
Powerful? 

When  viruses  can  coreside  on  computers,  it  is  much  more 
challenging  to  analyze  the  outcome  when  the  viruses  are 
(approximately)  equally  powerful.  Indeed,  we  are  only  able 
to  analyze  this  for  the  special  case  of  |  J\  =  2  with  respect  to 
the  state  transition  diagram  shown  in  Fig.  4. 

Theorem  23  (outcome  of  viruses  being  equally  powerful). 

Suppose  %1  =  %2,  tf{1},2  =  1?{2},1,  P{1},1  =  /3{2},2,  and 
P{1,2},1  =  /?{1,2},2-  If  Ai  <  ^)}^;1}2,  then  lim^  \iVt{1}{t)  - 
iv,{2}(f)\  =  0  for  all  v  G  V. 

Proof.  From  the  simplified  master  equation  (4.2),  we  have 


dtXv’{2}^  ~ 


!-  n  (! -%iv{i>w) 

.  {u,v)eE  ! 


- !-  n  (i_,,{)At2 iW) 

\  {u,v)eE  / 

^v,{}(t)  ~~  ^v,{2 }(£)) 


1  -  n  a- ^{ii,2V{2}(f))  mi }(t) 

K  (■ u,v)eE  J 

-I1-  n  ^-^ilAwW)  V{2}W 

\  {u,v)€E  ) 
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Noting 


Proof.  From  the  simplified  master  equation  (4.2),  we  have 


and 


1  -  n  (!  -%iV{i}(f)) 

\  ( u,v)€E  ) 

~(x~  n  (!-%2V2}(f)) 

\  {u,v)eE  y 

—  ^  ^  QJvu\i'u,{  1}  (^)  —  ^u,{2}(^)|^{},1’ 
u£V 


1  -  n  (!- ^{1}.2V{2}(<))  V{i}(i) 

,  (u,v)eE  ) 


- 1  -  n  “  ^{i},2V{i}(f))  vi2}(f) 

V  (u,v)eE  J 


<  -  1  -  n  (!- ^{l},2*u,{2}(f))  (V{l}(f)  -  iv,{ 

\  (u,v)eE  ) 

+  (l_  H  (1_%,2V{2}W)| 

V  (u,v)€E  ) 


~  !-  n  (!  -  0{i},2V{i}(f)) 
V  ( u,v)eE  L 


*«,{2}(f) 


<  -  1  -  n  a- ^{1}.2*»,{2  }(*))  (iv,{  l}(t)  -  V{2  }(<)) 

\  (u,v)€E  J 

4“  ^  ^  Q'vu\i’u,{l}  (f)  A, {2}  (f )  |^{1}, 2> 

u€V 

we  have 

^  |v{l}(^)  —  *0,{2}(f)|  <  — /^{},llv{l}(l')  —  V{2}(f)| 

+  flwj|*u,{l}(f)  —  *u,{2}(f)|(^{},l  +  T^{1 }  ,2 )  - 
ueV 

Consider  a  comparison  system 

;Zv{t)  =  —/3{},i  zv(t)  +  ^aTO2:„(<)(^{}il  +  d{i}!2). 


dt 


U€V 


One  can  see  that  zv(t)  >  Ivpjff)  —  iv,{2}(t)\  holds  for  all 
t  >  0  and  v  £  N  as  long  as  zv( 0)  >  |4,{i}(0)  —  *«,{2}(0)|- 
The  given  condition  implies  that  the  real  parts  of  all 
eigenvalues  of  —  /%,i4i  +  (${i},2  +  ${},i)A  are  negative. 
According  to  Lemma  2,  limt_,00^(f)  =  0  for  all  v  £  V. 
This  implies  lim^x,  |i„,{i}(f)  —  *„,{2}(f)|  =  0  for  all  v  £  V. 
The  proof  is  completed.  □ 

Theorem  24  (outcome  of  viruses  being  approximately 
equally  powerful).  Suppose  there  is  e  >  0  such  that 
l%i  -  ^{},2|  <  e,  |${i},2  -^{2},i|  <  L  |/3{i},i  -  /3{2},2|  <  L 
and  |/3{1,2},r  -  /?{i,2},  2 1  <  £■  If  M  then 

lim^oo |it,, {!}(<)  —  iv,{2}(t)\  <  d'efor  some  d!  >  0  independent 
of  e  and  all  v  £  V. 


^Mi}W-^V{2}(f) 


dt 


f-  n  (!  -^{i,iv{i}W) ) 

(u,v)gE  / 


1  1  I  (l-%2A,{2}(f)) 

\  (u,v)eE  J  _ 

—  /3{i},i(V{i}(f)  —  A, {2  }(f))  —  (/3{i},i  —  /3{2},2)V{2}(f) 

1  -  n  a- ^{i}.2 via) (*)) ) v{i) w 

(u,v)€E  / 


-  !-  n  (!  -  %,1*»,11}W)  ]V{2}(f) 

\  ( U,v)€E 

+  (/?{!, 2}, 2  —  /3{l,2},l)*t),{l,2  }(t)- 


Noting 

fi_  n  (i-%iv{i}(*))l 

\  {u,v)eE  / 

~fi_  n  (iMi?{},2V{2}(f)) 

\  (u,v)eE  J 

—  ^  ^  ^vu  Ku,{l|  (^)  ^u,{  2}(^)|^{},! 


ueV 


+ 


n  (!  -  %iv{2}W)  -  n  i1  -  %2M2}W) 

(u,v)£E  ( u,v)€.E 


—  ^  ^  Q,vu\'i'u,{l}  (t)  ^u,{2}  (^)  ^  ^  Q'vu'i,u,{2}  (^)  I^{},1  ^{},2| 

u,eV  u 

—  ^  ^vu (^)  ^u,{2}  (^)  |^{},i  “t-  da. 


u&V 

and 


1  -  n  (!- ^{1}.2V{2} (f))  V{l}(f) 

,  ( u,v)€E  ) 


- 1  -  n  t1  -  ^{ii,2V{i}(*))  vi2}(f) 

V  (u,v)eE  ) 


<  -  1  -  n  (r“  ^{l},2*M,{2}(f))  (V{1}(1)  -  4,{2}(f)) 

•  (  1  ~  H  (1  -  1?{2},l*u,{2}(f))  J 

V  (u,v)£E  ) 


~  !-  n  (!  -  ^{i},2V{i}(f)) 


(u,v)€E 


iv,{2}(t) 


<  -  1  -  n  _  ^{l},2*M,{2}(f))  (iy,{l}(t)  -  iv,{ 2}{t)) 

\  ( u,v)£E  J 

+  fljiu|*M,{i}(f)  —  *ii,{2}(f)|^{i},2  +  de, 


ueV 
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(a)  disjoint  spreading  (b)  independent  spreading 

Fig.  5.  State  transition  diagrams  of  node  v  eV  with  \  J\  =  2:  disjoint  spreading  versus  independent  spreading. 


we  have 

^|V{i}W  —  V{2}(f)  —  —  /3{},i|*n,{t}(^)  *^,{2}  (^-)  I 

+  Oiro|*u,{l}(i)  _  hi,{2}(f)|(${},l  +  ${1},2)  +  d'e , 

U€V 

where  d!  =  2d  +  3.  Via  Lemma  1,  the  given  condition 
implies  — /3pj  /„  +  (t?{i},2  +  #{},i)A  is  an  M-matrix.  The 
theorem  can  be  proved  in  the  same  fashion  as  in  the 
proof  of  Theorem  16.  □ 

The  above  analytical  results  allow  us  to  draw: 

Insight  25.  If  two  viruses  have  (approximately)  equal  infection 
power  and  are  (approximately)  equally  defended  against,  then 
they  will  have  (approximately)  the  same  fate,  no  matter  how 
many  nodes  they  initially  infect  and  no  matter  they  will  die  out 
or  not.  Moreover,  if  they  die  out,  they  will  die  out  (asympto¬ 
tically)  at  the  same  time. 

5  Further  Insights  Offered  by  the 
Theoretical  Results 

First,  by  comparing  Corollary  5  and  Corollary  7,  we  draw: 

Insight  26.  The  (reactive)  countermeasures,  which  are  sufficient 
for  killing  the  spreading  of  multiple  viruses  when  some  (but 
not  many)  nodes  are  initially  infected,  might  not  be  sufficient 
for  killing  the  spreading  of  multiple  viruses  when  many  nodes 
are  initially  infected.  Therefore,  preventive/proactive  defense 
and  rapid  attack  detection/response  are  always  very  valuable. 

Second,  consider  a  special  case  of  the  model  investigated 
in  Section  3,  namely  that  the  viruses  spread  disjointly  but 
without  being  able  to  rob  each  other,  meaning  t>ij  =  0  for 
l.  j  e  J.  Then,  the  state  transition  diagram  in  Fig.  1  is  now 
simplified  as  Fig.  5a,  and  the  sufficient  condition  given  in 
Corollary  9  becomes  Ai  <  One  the  other  hand, 

recall  the  sufficient  condition  (due  to  [10])  for  the  dying  out 
of  a  single  virus  spreading  in  an  arbitrary  network  is  Ai  <  ^ . 
As  a  trivial  extension,  this  means  that  when  multiple 
viruses  spread  independently  over  the  same  network,  the 
sufficient  condition  for  virus  j  to  die  out  is  <—.  For 
the  case  of  two  viruses,  the  state  transition  diagram  is 
Fig.  5b.  Also  considering  Corollaries  10  and  18,  the  above 
discussion  leads  to: 

Insight  27.  The  (reactive)  countermeasures,  which  are  sufficient 
for  killing  the  independent  spreading  of  viruses  when  many 
nodes  are  initially  infected,  are  1)  sufficient  for  killing  the 
disjoint  spreading  of  viruses  when  some  nodes  are  initially 


infected,  2)  probably  not  sufficient  for  killing  the  disjoint 
spreading  of  viruses  when  many  nodes  are  initially  infected, 
and  3)  probably  not  sufficient  for  killing  the  nondisjoint 
spreading  of  viruses  under  certain  circumstances  (e.g.,  /3{i},i  — 
/3{i,2},2  =  0  or  P{2},2  ~  =  0  when  there  are  two  viruses). 

The  above  1)  and  2)  characterize  the  power  disjoint  versus 
independent  spreading  in  terms  of  the  tolerable  number  of 
initially  infected  nodes.  It  may  seem  counterintuitive,  but  is 
actually  reasonable  because  the  sufficient  condition  deals 
with  the  dying  out  of  viruses,  and  nothing  else  (e.g.,  what 
happens  when  the  viruses  do  not  die  out).  Note  that  there  is 
no  contradiction  between  this  statement  and  Insight  14, 
where  we  claimed  that  the  stronger  viruses  could  be 
exploited  to  alleviate  (but  not  kill)  the  weaker  viruses  that 
are  actually  harder  for  the  defender  to  kill  directly. 

Third,  by  considering  Theorems  11  and  12,  and  further 
Corollaries  9  and  10,  we  draw: 

Insight  28.  When  some  viruses  can  be  robbed  by,  but  cannot  rob, 
the  others,  it  is  easier  to  kill  these  zveak  viruses  when  they 
initially  infected  some  nodes  than  zuhen  they  initially  infected 
many  nodes.  Moreover,  zveak  viruses  may  be  killed  without 
strong  viruses  being  killed. 

The  above  insight  cannot  be  directly  applied  to  the  case 
where  the  viruses  spread  nondisjointly  and  do  not  rob  each 
other.  Still,  by  comparing  Theorems  20  and  21,  we  draw: 

Insight  29.  When  the  viruses  do  not  rob  each  other  and  spread 
nondisjointly  (but  not  independently),  it  is  easier  to  kill  a 
subset  of  viruses  when  they  initially  infected  some  nodes  than 
when  they  initially  infected  many  nodes. 

6  Simulation  Study 

We  use  simulation  (under  the  same  model  assumption)  to 
confirm  the  analytical  results  in  the  disjoint  spreading 
model  (Section  3)  because  of  its  simplicity.  We  consider 
three  examples  of  G  =  (V,E):  Erdos-Renyi  random  graph, 
regular  graph,  and  power  law  graph.  In  each  case,  we  have 
|V|  =  2,000.  The  random  graph  has  average  node  degree 
6.001  and  Ai  =  7.1424.  The  regular  graph  has  node  degree  6 
and  Ai  =  6.  The  power  law  graph,  which  is  generated  using 
Brite  [23],  has  average  node  degree  5.994  and  Ai  =  11.6514. 
The  graphs  have  approximately  the  same  average  degree 
because  we  want  to  observe,  as  a  side  product,  whether 
there  is  any  difference  caused  by  topology.  We  simulated 
two  settings  with  \  J\  =  2  and  \  J\  =  3;  in  what  follows  we 
only  report  the  results  corresponding  to  \  J\  =  2  because  of 
space  limitation  (\J\  =  3  exhibits  similar  outcomes  though). 
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Fig.  6.  Simulation  confirmation  of  Corollary  9  with  two  viruses. 
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Fig.  7.  Simulation  confirmation  of  Corollary  10  with  two  viruses. 


(c)  powerlaw  graph 


Other  parameters  will  be  specified  below.  To  represent  the 
dynamics,  we  plot  the  number  of  nodes  infected  by  virus 
j  G  J  at  time  t  based  on  the  average  of  20  independent 
simulation  runs. 

6.1  Simulation  Study  for  Verifying  the  Sufficient 
Conditions  for  the  Dying  Out  of  Viruses 

Simulation  confirmation  of  Corollary  9.  We  let  each  virus 
initially  infect  400  or  20  percent  randomly  selected  nodes. 
For  each  topology,  we  use  the  following  three  parameter  set 
scenarios:  1)  (ft,  71,  ft,2)  =  (0.1,0.0005,0.001)  and  (ft, 72, 
ft,i)  =  (0.1,  0.00045,  0.002);  2)  (ft,  7l,  ft,2)  =  (0.2,  0.0025, 
0.001)  and  (ft,  72,  ft,i)  =  (0.15,  0.002,  0.002);  3)  (ft,  71, 
ft, 2)  =  (0.2,0.005,0.001)  and  (ft,  72,  ft,i)  =  (0.3,  0.0045, 
0.005).  In  each  parameter  scenario,  the  sufficient  condition 
stated  in  Corollary  9  is  satisfied,  and  thus  the  analytical 
result  says  that  the  two  viruses  will  die  out  regardless  of  the 
degree  of  initial  infection. 

Fig.  6  plots  (partially)  the  simulation  results,  and  shows 
that  the  three  graphs  demonstrate  similar  phenomenon. 
For  example,  both  viruses  spreading  over  the  regular 
graph  die  out  at  about  the  l,000th/4, 000th/ 12,000th  step  in 
parameter  scenarios  l)/2)/and  3),  respectively. 

Simulation  confirmation  of  Corollary  10.  We  let  each 
virus  initially  infect  75  or  3.75  percent  randomly  selected 
nodes.  For  each  topology,  we  consider  the  following  three 
parameter  set  scenarios:  1)  (ft, 71, ft,2)  =  (0.3,0.021,0.1) 
and  (ft,  72,  ft,i)  =  (0.3,  0.024,  0.2);  2)’  (ft, 71, ft. 2)  =  (0.5, 
0.031,  0.1)  and  (ft,  72,  ft,i)  =  (0.5,  0.034,  0.2);  3)  (ft,  71, 
ft, 2)  =  (0.6,0.024,0.1)  and  (ft, 72, ft, 1)  =  (0.6,0.048,0.2).  In 
each  parameter  scenario,  the  sufficient  condition  in  Cor¬ 
ollary  10  is  satisfied,  and  thus  the  viruses  should  die  out. 

Fig.  7  plots  (partially)  the  simulation  results,  which 
demonstrate  that  the  three  graphs  demonstrate  similar 
phenomenon.  For  example.  Fig.  7b  plots  the  dynamics  in 


the  regular  graph  and  shows  that  both  viruses  die  out  at 
about  the  l,200th/2, 800th/ 10,000th  step  in  the  parameter 
scenarios  l)/2)/3),  respectively.  Note  that  there  is  a  period 
of  time  at  the  beginning  stage,  during  which  the  two  viruses 
exhibit  different  characteristics.  This  is  mainly  caused  by 
the  fact  that  ft, 2  =  0.1  <  1/2,1  =  0.2,  meaning  that  virus  l's 
robbing  capability  is  stronger  than  virus  2's. 

Because  Corollary  10  hints  that  its  sufficient  condition  is 
close-to-being-necessary,  we  would  like  to  confirm  this 
property  as  well.  We  let  each  virus  initially  infect  75  or 
3.75  percent  randomly  selected  nodes,  respectively.  For 
each  topology,  we  consider  three  parameter  scenarios: 
1)  (ft,  7i,  ft, 2)  =  (0.6,  0.5,  0.1)  and  (ft, 72, ft, 1)  =  (0.6, 
0.048,0.2);  2)  (ft,  7l,  ft, 2)  =  (0.5, 0.3, 0.1)  and  (ft, 72, 
ft, 1)  =  (0.5,0.034,0.2);  3)  (ft, 71, ft, 2)  =  (0.3, 0.1, 0.1)  and 
(ft,  72,  ft, 1)  =  (0.3,0.024,0.1).  In  every  parameter  set 
scenario  of  every  case,  the  sufficient  condition  is  slightly 
violated  and  the  viruses  may  not  die  out.  Fig.  8  plots 
(partially)  the  simulation  results,  which  show  that  in  every 
scenario  of  every  case,  the  viruses  do  not  die  out. 

In  summary,  our  simulation  confirms  Corollaries  9  and 
10.  As  a  side-product,  our  simulation  study  suggests  that 
it  seems  more  difficult  to  kill  all  the  viruses  in  regular 
and  random  networks  than  in  power  law  networks.  It  is 
interesting  to  prove  this  analytically. 

6.2  Simulation  Study  for  Verifying  the  Outcome 
when  Some  Viruses  Are  More  Powerful 

Simulation  confirmation  of  Theorem  11.  We  let  virus  1 
initially  infect  320  or  16  percent  randomly  selected  nodes, 
and  virus  2  initially  infect  1,280  or  64  percent  randomly 
selected  nodes.  Fig.  9  plots  the  dynamics  of  the  two  viruses 
with  respect  to  the  following  three  parameter  scenarios  in 
each  case  of  system  graph  topologies:  1)  (ft, 71, 1/1,2)  = 
(0.4,0.02,0.2)  and  (ft, 72, ft,  1)  =  (0.001, 1.0, 0.0);  2)  ’(ft, 
7i, ft, 2)  =  (0.6,0.02,0.3)  and  (ft, 72, ft,  1)  =  (0.002, 1.0, 0.0); 
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Fig.  8.  Simulation  confirmation  of  Corollary  10  with  two  viruses. 
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Fig.  10.  Simulation  confirmation  of  Theorem  12  with  two  viruses. 


(c)  powerlaw  graph 


3)  (ft,  Ti.01, 2)  =  (0.7,0.030.4)  and  (ft,  72, ft,i)  =  (0.002, 
1.0, 0.0).  Note  that  in  each  parameter  set  scenario,  virus  1 
is  the  weaker  one  (i.e.,  J  =  {1})  and  virus  2  is  the  stronger 
one,  and  the  parameters  corresponding  to  virus  1  satisfy 
the  sufficient  condition  in  Theorem  11,  which  means  that 
virus  1,  but  not  necessarily  virus  2,  will  die  out.  Note  also 
that  we  intentionally  select  the  parameters  for  virus  2  so 
that  the  sufficient  condition  in  Corollary  9  is  violated; 
otherwise,  virus  2  is  bound  to  die  out  as  well.  Fig.  9  plots 
part  of  the  simulation  results.  With  respect  to  the 
parameter  scenarios  l)/2)/3),  simulation  shows  that  in 
the  random  graph,  the  weaker  virus  dies  out  at  about  the 
800th/3,100th/8, 000th  step,  respectively;  in  the  regular 
graph,  the  weaker  virus  dies  out  at  about  the  790th/ 
3,100th/8,000th  step,  respectively;  in  the  power  law  graph, 
the  weaker  virus  dies  out  at  about  the  800th/ 2, 220th/ 
4,500th  step,  respectively.  Note  that  in  all  cases  virus  2  dies 
out  much  more  slowly,  which  does  not  contradict  with  the 
analytical  result. 

Simulation  confirmation  of  Theorem  12.  We  let  virus  1 
initially  infect  80  or  4  percent  randomly  selected  nodes,  and 
virus  2  initially  infect  320  or  16  percent  randomly  selected 


nodes.  Fig.  10  plots  (partially)  the  dynamics  of  two  viruses 
with  the  following  three  parameter  scenarios  in  each  case  of 
system  graph  topologies:  1)  (ft,  71,  $1,2)  =  (0.3,0.021,0.1) 
and  (/32,72,1V)  =  (0-°08, 1.0, 0.0);  2)  (ft, 7l, ft,2)  =  (0.4, 
0.032,0.4)  and  (ft, 74, ft, 1)  =  (0.009, 1.0, 0.0);  3)  (ft, 71, 
ft, 2)  =  (0.6,0.042,0.4)  and  (ft, 72, ft,  1)  =  (0.015,1.0,0.0).  In 
each  parameter  scenario,  virus  1  is  the  weaker  one  and  virus 
2  is  the  stronger  one,  and  the  parameters  corresponding  to 
virus  1  satisfy  the  sufficient  condition  in  Theorem  12,  which 
means  that  virus  1,  but  not  necessarily  virus  2,  will  die  out. 
With  respect  to  the  parameter  scenarios  l)/2)/3),  simula¬ 
tion  shows  that  in  the  random  graph,  the  weaker  virus  dies 
out  at  about  the  l,500th/2, 400th/ 12,000th  step,  respectively; 
in  the  regular  graph,  the  weaker  virus  dies  out  at  about  the 
l,300th/2,300th/ll, 400th  step,  respectively;  in  the  power 
law  graph,  the  weaker  virus  dies  out  at  about  the  800th/ 
l,100th/3, 900th  step,  respectively. 

In  summary,  our  simulation  study  confirms  the  sufficient 
conditions  given  in  Theorems  11  and  12,  under  which  the 
weaker,  but  not  necessarily  the  stronger,  viruses  die  out. 
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Fig.  11.  Simulation  confirmation  of  Theorem  15  with  two  viruses. 
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6.3  Simulation  Study  for  Verifying  the  Outcome  of 
Viruses  Being  Equally  Powerful 

Fig.  11  plots  (partially)  the  dynamics  of  two  viruses  with 
three  parameter  scenarios:  1)  (ft, 71, $1,2)  =  (0.2,0.0018, 
0.002)  and  (ft,  72,  ftp)  =  (0.2,  0.0018,  0.002);  2)  (ft,  7l, 
tli, 2)  =  (0.4,  0.0022,  0.002)  and  (ft,  72,  ft,i)  =  (0.4,  0.0022, 
0.002);  3)  (ft,  71,  ft, 2)  =  (0.6,  0.0036,  0.002)  and  (ft,72, 
ft,i)  =  (0.6,0.0036,0.002).  All  the  parameter  scenarios  sa¬ 
tisfy  the  sufficient  condition  in  Corollary  9,  meaning  that 
the  two  viruses  will  die  out,  and  the  sufficient  condition  in 
Theorem  15,  meaning  that  the  two  viruses  will  exhibit  the 
same  dynamics.  We  let  each  of  them  initially  infect  400  or 
20  percent  randomly  selected  nodes.  Fig.  11  demonstrates 
that  for  each  graph  in  each  of  the  parameter  scenario,  the 
two  viruses  exhibit  almost  identical  dynamics. 

Fig.  12  plots  (partially)  the  dynamics  of  two  viruses  with 
the  following  three  parameter  scenarios:  1)  (ft, 71,  ft, 2)  = 
(0.4,0.005,0.02)  and  (ft,  72,  ft,  i)  =  (0.4, 0.005, 0.02);  2) 
(ft, 71^1,2)  =  (0.6, 0.0075,  0.03)  and  (ft, 72, ft, 1)  =  (0.6, 
0.0075,  0.03);  3)  (ft, 71,  ft, 2)  =  (0.8, 0.01, 0.04)  and  (ft, 72, 
ft,i)  =  (0-8, 0.01, 0.04).  All  of  the  three  parameter  sets  do  not 
satisfy  the  sufficient  condition  in  Corollary  9,  meaning  that 
the  two  viruses  may  not  die  out,  but  satisfy  the  sufficient 
condition  in  Theorem  15,  meaning  that  the  two  viruses  will 
exhibit  the  same  dynamics.  We  let  each  of  them  initially 
infect  400  or  20  percent  randomly  selected  nodes.  Fig.  12 
demonstrates  that  for  each  graph  in  each  of  the  parameter 
scenario,  the  two  viruses  exhibit  quite  identical  dynamics. 
The  small,  but  sometimes  noticeable  difference  is  attributed 
to  the  fact  that  the  curves  are  based  on  the  average  of  20 
runs  and  the  viruses  dying  out  very  slow  (in  which  case 
oscillation  is  possible). 

In  summary,  out  simulation  study  confirms  the  analy¬ 
tical  result  of  Theorem  15,  namely  that  if  the  viruses  are 


equally  powerful,  then  they  will  eventually  exhibit  the  same 
dynamics.  In  particular,  if  they  are  to  die  out,  they  will  die 
out  at  the  same  time. 

7  Conclusion 

We  introduced  a  general  model  of  multivirus  spreading 
dynamics,  where  the  viruses  may  combat  against  each 
other.  We  presented  analytical  results  in  the  general  model, 
including  sufficient  conditions  under  which  the  viruses  die 
out  when  many  or  some  nodes  are  initially  infected,  and 
deeper  results  for  two  special  cases  of  the  general  model. 
The  analytical  results  made  a  fundamental  connection 
between  the  defense  capability  and  the  connectivity  of  the 
network  (via  the  largest  eigenvalue  of  its  adjacency  matrix), 
and  allowed  us  to  draw  various  insights  that  can  be  used  to 
guide  security  defense. 

This  paper  brings  a  range  of  new  problems  for  future 
research.  In  addition  to  those  mentioned  in  the  body  of 
the  paper,  we  offer:  When  we  state  the  degree  of  initial 
infection,  we  used  the  qualitative  term  'Tight  initial 
infection,"  but  can  we  quantify  it?  What  are  the  necessary 
conditions  under  which  the  viruses  will  die  out?  What  are 
the  complete  dynamics  characteristics  when  (some  or  all  of) 
the  viruses  do  not  die  out? 
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Long-Range  Dependence  as  Exhibited  by  Honeypot-Captured  Cyber  Attacks: 
The  Phenomenon,  Implication  and  Mysterious  Cause 


Abstract — In  this  paper  we  report  the  Long-Range  De¬ 
pendence  (LRD)  phenomenon  that  is  exhibited  by  honeypot- 
captured  cyber  attacks.  To  our  knowledge,  LRD  was  not  known 
to  be  relevant  in  the  cyber  security  domain  until  now,  although 
it  has  been  known  for  two  decades  that  LRD  is  exhibited 
by  benign  traffic  (i.e.,  no  attacks).  We  demonstrate  the  cyber 
security  implications  of  the  LRD  phenomenon,  by  showing  how 
to  predict  the  rates  of  incoming  attacks  with  reasonable  good 
precision,  especially  at  the  scale  of  “aggregation”  the  victims 
as  a  network.  This  implies  the  possibility  of  providing  real- 
life  defenders  with  sufficient  early-warning  time  for  adjusting 
their  defense  configurations  (e.g.,  proactively  allocating  more 
resources  for  deep  packet  inspection).  We  also  explore  the  cause 
of  the  LRD  phenomenon.  We  find  that  the  theory,  which  was 
developed  for  explaining  the  possible  cause  of  LRD  in  benign 
traffic,  is  probably  not  applicable  to  the  LRD  observed  in  the 
cyber  security  domain.  We  further  find  that  LRD  observed  in 
the  cyber  security  domain  is  unlikely  caused  by  the  intensity 
of  attacks  that  are  launched  by  individual  attackers. 

Keywords -Cyber  security,  long-range  dependence  (LRD), 
attack  prediction,  stochastic  cyber  attack  processes,  attack 
inter-arrival  time,  attack  rate 

I.  Introduction 

This  paper  studies  statistical  phenomena/properties  of 
cyber  attacks  as  exhibited  by  honeypot-captured  attack  data. 
Studies  of  this  kind  are  important  because  they  can  lead  to 
deeper  understanding  of  cyber  security  from  a  perspective 
that  is  different  from  most  existing  studies.  For  example,  the 
statistical  properties  may  have  important  real-life  implica¬ 
tions  that  can  be  exploited  for  more  effective  cyber  defense. 
Such  insights  often  cannot  be  obtained  from  studies  at  some 
lower  levels  of  abstractions. 

Our  Contributions 

In  this  paper,  we  make  two  contributions.  First,  we  introduce 
the  concept  of  stochastic  cyber  attack  processes,  which  are 
a  new  kind  of  mathematical  objects  for  modeling  certain 
cyber  attacks.  We  then  present  a  formal  statistical  method¬ 
ology  for  analyzing  them.  The  methodology  is  centered  on 
answering  three  questions:  (1)  What  phenomena/properties 
do  the  processes  exhibit  (e.g.,  are  they  Poisson)?  (2)  What 
are  the  implications  of  the  phenomena/properties  (e.g.,  can 
we  exploit  them  for  better  prediction  of  the  incoming  at¬ 
tacks)?  (3)  What  are  the  cause  of  the  observed  phenomena? 
Although  the  present  paper  only  deals  with  the  stochastic 
cyber  attack  processes  that  correspond  to  honeypot-captured 
cyber  attacks,  the  methodology  can  be  applied  to  analyze 
other  kinds  of  cyber  attack  data,  such  as  the  attacks  against 
(individual  computers  in)  production  network  —  if  such  data 
is,  fortunate  enough,  available. 


Second,  our  formal  statistical  analysis  of  honeypot- 
captured  data  (corresponding  to  166  IP  addresses  for  five 
periods  of  time,  or  220  days  accumulatively)  led  to  the 
following  findings.  (1)  The  stochastic  cyber  attack  processes 
are  not  Poisson  but  majority  of  the  processes  instead  exhibit 
Long-Range  Dependence  (LRD),  a  phenomenon  that  was 
not  known  to  be  relevant  in  the  cyber  security  domain 
until  now.  (2)  We  explore  how  to  exploit  the  presence  of 
LRD  to  conduct  more  accurate  predictions  for  the  attack 
processes  corresponding  to  individual  victims,  and  especially 
for  the  attack  processes  corresponding  to  the  victim  network 
(i.e.,  by  aggregating  the  relevant  victims  as  a  whole).  This 
shows  the  power  of  “gray-box”  (rather  than  “black-box”) 
prediction  because  the  prediction  models  can  accommodate 
the  LRD  phenomenon.  (3)  We  explored  the  cause  of  LRD 
as  exhibited  by  the  honeypot-captured  data.  We  confirm  that 
non-stationarity  of  processes  can  cause  “illegitimate”  LRD. 
However,  for  the  “legitimate”  LRD  (i.e.,  not  caused  by  non- 
stationarity  of  processes),  we  find  that  LRD  is  probably 
not  caused  by  that  the  underlying  stochastic  attack  (sub- 
)processes  exhibit  the  heavy-tail  phenomenon;  this  is  in 
contrast  to  the  setting  of  benign  traffic,  where  LRD  was 
believed  to  be  caused  by  the  superposition  of  processes 
that  exhibit  the  heavy-tail  phenomenon  [1],  We  further  find 
that  the  LRD  observed  here  was  probably  not  caused  by 
intense  (consecutive)  attacks  that  are  launched  by  individual 
attackers.  This  means  that  more  studies  are  needed  before 
we  can  actually  pin  down  the  actual  fundamental  cause  of 
LRD  in  cyber  security  domain. 

Related  Work 

The  present  study  is  based  on  honeypot-collected  cyber 
attack  data.  Previous  studies  of  such  data  are  mainly  from 
other  perspectives,  such  as  analyzing  the  honeypot-observed 
probing  activities  [2],  characterizing/grouping  attacks  [3], 
[4],  [5],  [6],  [7],  [8],  [9],  [10],  [11],  and  identifying  methods 
to  detect  attacks  such  as  DoS  (denial-of-service)  [12],  scans 
[13],  worms  [14],  [15],  and  botnets  [16],  [17].  In  contrast, 
we  study  the  statistical  properties  of  the  attack  processes 
corresponding  to  honeypot-emulated  services  (or  ports),  and 
the  feasibility  of  exploiting  the  identified  properties  for 
predicting  the  incoming  attacks. 

LRD  was  observed  in  benign  traffic  about  two  decades 
ago  [18],  [19],  [20],  [21],  To  our  knowledge,  LRD  is 
identified  here  to  be  relevant  in  the  cyber  security  domain. 

The  rest  of  the  paper  is  organized  as  follows.  Section  II 
briefly  review  some  statistical  preliminaries.  Section  III  in¬ 
troduces  the  data  we  analyze.  Section  IV  describes  the  iden- 


tification  of  LRD  that  is  exhibited  by  the  attack  processes 
corresponding  to  individual  victim  honeypot  IP  addresses. 
Sections  V  and  VI  explore,  at  the  scale  of  victims  (i.e.,  each 
attack  process  corresponds  to  a  victim),  the  exploitation  of 
the  LRD  for  predicting  incoming  attacks  and  the  cause  of 
LRD.  Section  VII  studies  the  same  data  but  from  a  more 
macroscopic  perspective,  by  treating  the  entire  honeypot  as 
a  victim  network.  Section  VIII  discusses  the  limitations  of 
the  present  study.  Section  IX  concludes  the  paper  with  future 
research  directions. 


II.  Statistical  Preliminaries 
Long-Range  Dependence  (LRD) 

The  LRD  phenomenon  has  received  much  attention  in 
contexts  other  than  cyber  security  [22],  [23],  [24],  [21].  A 
stationary  time  sequence  { X,,  i  >  1}  is  said  to  possess  LRD 
if  its  autocorrelation  function 


p(h)  =  Cor^Xi,  Xi+h)  ~  h  0L(h),  h oo,  (1) 


for  0  <  /3  <  1,  and  L(-)  is  a  slowly  varying  function  (cf. 
[25]).  That  is,  for  all  t  >  0, 


lim 

x — yoo 


L(tx) 

L(x) 


=  1. 


Note  that  PQ1)  —  oo,  which  captures  the  intuition  behind 
LRD  that  while  high-lag  autocorrelations  are  small  but  their 
cumulative  effect  is  important.  The  other  definition  of  LRD 
is  based  on  spectral  density  [26],  [27],  A  stationary  time 
series  is  said  to  have  LRD  if  its  spectral  density  at  frequency 
A  is  proportional  to  A_2ldl,  d  ^  0  as  A  — >  0.  These 
definitions  are  equivalent. 

The  statistical  feature  of  LRD  time  series  is  that  the  degree 
of  LRD  can  be  expressed  using  one  parameter,  which  is 
called  Hurst  parameter  (H).  The  parameter  3  in  Eq.  (1)  is 
related  to  the  Hurst  parameter  via  the  equation  /3  =  2  —  2 H. 
Therefore,  for  LRD,  we  have  1/2  <  H  <  1,  and  the  degree 
of  LRD  increases  as  H  — >  1.  The  parameter  in  spectral 
density  is  related  to  Hurst  parameter  via  the  equation  H  = 
.5  +  \d\.  In  the  Appendix,  we  briefly  review  six  popular 
methods  for  estimating  the  Hurst  parameter,  which  are  used 
in  our  analysis. 


Methods  for  removing  illegitimate  LRD 

It  is  known  that  LRD  can  also  be  explained  by  the  processes 
that  are  not  stationary  long-range  dependent  [28].  In  the 
literature,  the  processes  that  have  been  found  to  generate 
illegitimate  long  memory  which  include:  a  short-range  de¬ 
pendent  time  series  with  change  points  in  the  mean;  slowly 
varying  trends  with  random  noise;  a  stationary  parametric 
time  series  model  with  time-varying  parameters,  etc  [27], 
[29], 

Therefore,  we  need  to  remove  the  attack  processes  that 
exhibit  illegitimate  LRD.  The  following  method  is  for  this 
purpose,  which  tests  for  the  null  hypothesis  that  a  given  time 
series  is  a  stationary  LRD  process  against  the  alternative 
hypothesis  that  it  is  affected  by  change  points  or  a  smoothly 


varying  trend  [27] .  The  test  is  in  the  frequency  domain  and 
is  based  on  the  derivatives  of  the  profiled  local  Whittle 
likelihood  function  in  a  degenerating  neighborhood  of  the 
origin.  More  specifically,  we  are  testing  for: 

Hq'.  Xt  is  stationary  with  LRD. 

VS 

Ha-  Xt  —  Zt  +  [it  with  [it  =  [it- i  +  VWti 

where  Zt  is  a  stationary  short  memory  process,  r[t  is  a 
white  noise  process  and  y'y  is  a  Bernoulli  random  variable 
which  takes  value  1  with  probability  pn.  We  also  test  for 
the  alternative 

Ha-  Xt  =  Zt  +  h(t/n ), 

where  h(-)  is  a  Lipschitz  continuous  function  on  [0, 1]. 

Two  statistical  models 

Throughout  this  paper,  we  call  a  model  LRD-less  if  it  cannot 
accommodate  LRD,  and  call  a  model  LRD-aware  if  it  can 
accommodate  LRD.  The  purpose  is  to  show  that  prediction 
of  cyber  attacks  should  be  based  on  LRD-aware  models  for 
better  accuracy  when  the  cyber  attack  data  exhibited  the 
LRD  phenomenon.  Specifically,  we  consider  two  popular 
models  of  these  kinds.  Let  ey  be  independent  and  identical 
normal  random  variables  with  mean  0  and  variance  <r2. 

•  The  LRD-less  ARMA  model:  Autoregressive  moving 
average  process  of  orders  p  and  q  model,  abbreviated 
to  ARMA(p.q),  is  one  of  the  most  popular  models  in 
time  series  [26],  with 

p  q 

Xt  =  <j>iYt-i  +  et.  +  djet-j- 
»= i  f=i 

•  The  LRD-aware  FARIMA  model:  A  well-known 
LRD-aware  model  [22],  [23],  [24]  is  the  Fractional 
ARIMA(p,  d,  q)  (FARIMA)  with  0  <  d  <  1/2  and 
H  =  d  +  1/2.  Specifically,  define  polynomials 

p  q 

cj)(x)  =  1  —  ^  and  ip{x)  =  1  +  ^  ipjX'’  ■ 

7=1  7=1 

Let  Xt  be  a  stationary  process  such  that 

3(B)(l-B)dXt  =^(B)et, 

for  some  —1/2  <  d  <  1/2,  where  B  is  the  back  shift 
operator  defined  by  BXt  =  Xt-i,  B2Xt  =  Xt-2,  and 
so  on.  Then  Xt  is  called  FARIMAfp,  d,  q). 

Measurement  of  prediction  accuracy 

Suppose  Xm,  Xm+i, . . . ,  Xz  are  observed  data  (all  non¬ 
negative),  and  Ym,Ym+i,  ■  ■  ■  Yz  are  the  predicted  data. 
We  can  define  the  prediction  error  ey  =  Xt  —  Yt,  where 
m  <  t  <  z.  Recall  the  popular  statistic  PMAD  (Percent 
Mean  Absolute  Deviation): 

PMAD  = 

E  ;=mXt 
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While  PMAD  can  be  seen  as  the  overall  prediction  error,  we 
can  define  a  variant  we  call  underestimation  error,  which 
counts  only  the  underestimations  as  follows: 


PMAD'  = 


E  Um  e« 
E  Lm*t 


for  et  >  0  and  corresponding  Xt- 


This  is  relevant  because  if  the  defender  is  willing  to  over¬ 
provision  some  defense  resources,  the  predicted  results  can 
be  more  useful  because  it  is,  as  we  will  see,  relatively  easier 
to  contain  underestimation  error  (which  can  correspond 
to  attacks  that  can  be  overlooked  due  to  the  insufficient 
provisioning  of  resources). 

We  also  found  it  convenient,  when  dealing  with  the 
prediction  accuracy  at  the  level/scale  of  victims,  to  define 
prediction  overall  accuracy,  or  OA  for  short,  based  on  the 
PMAD  statistic  as  follows: 


OA  =  1  -  PMAD  =  1  - 


EUn  N 

Et=m  Xt  ' 


Moreover,  in  the  context  of  the  present  paper,  overestimation 
(i.e.,  Yj  >  Xj  for  some  j’s)  meaning  the  over-provisioning 
of  resources  in  expecting  incoming  attacks,  but  underesti¬ 
mation  (i.e.,  Yj  <  Xj  for  any  j )  can  cause  the  overlook  of 
attacks  (i.e.,  a  sort  of  false-negative  because  the  defender 
does  not  allocate  enough  resource  to  examine  the  incoming 
traffic).  Therefore,  it  makes  a  good  sense  to  define  the 
following  underestimation  accuracy,  or  UA  for  short: 


UA  =  1  - 


Et=m  e« 
E  t=mXt 


for  et  >  0  and  corresponding  Xt- 


That  is,  UA  disregards  the  over-estimations. 


III.  Data  Description 

The  attack  traffic  data  was  collected  by  a  honeypot, 
which  ran  four  popular  low-interaction  honeypot  software 
programs:  Dionaea  [30],  Mwcollector  [31],  Amun  [32],  and 
Nepenthes  [33].  Each  honeypot  IP  address  was  assigned 
to  one  of  these  programs  and  was  completely  isolated 
from  the  other  honeypot  IP  addresses.  To  save  resources, 
a  single  honeypot  computer  was  assigned  with  multiple  IP 
addresses  and  thus  ran  multiple  honeypot  software  programs. 
A  dedicated  computer  was  used  to  collect  the  raw  network 
traffic  as  p cap  files,  which  are  timestamped  at  the  resolution 
of  microsecond.  The  vulnerable  services  offered  by  all  four 
honeypot  programs  are  SMB,  NetBIOS,  HTTP,  MySQL  and 
SSH,  each  of  which  is  associated  to  a  unique  TCP  port.  This 
means  that  each  IP  address  (i.e.,  honeypot  software)  opens 
the  ports  corresponding  to  these  services.  We  call  these  ports 
production  ports,  and  the  other  ports  non-production  ports 
(because  they  are  associated  to  no  services).  The  concrete 
attacks  targeting  the  production  ports  can  be  dependent  upon 
the  specific  vulnerabilities  emulated  by  the  honeypot  pro¬ 
grams  (e.g.,  the  Microsoft  Windows  Server  Service  Buffer 
Overflow  MS06040  and  Workstation  Service  Vulnerability 
MS06070  for  the  SMB  service).  However,  low-interactive 
honeypots  do  not  capture  sufficient  information  for  precisely 
characterizing  the  specific  attacks. 


Period 

Time 

Duration  (days) 

#  of  victim  IPs 

I 

11/04/2010  -  12/21/2010 

47 

166 

II 

02/09/2011  -  02/27/2011 

18 

166 

III 

03/12/2011  -  05/06/2011 

54 

166 

IV 

05/09/2011  -  05/30/2011 

21 

166 

V 

06/22/2011  -  09/12/2011 

80 

166 

Table  I:  Data  description 


Table  I  summarizes  the  datasets,  which  correspond  to  166 
honeypot  IP  addresses  and  five  periods  of  time.  These  peri¬ 
ods  are  not  strictly  consecutive  because  of  network/system 
maintenance  etc. 

A  preprocess  procedure  was  used  to  translate  the  raw 
pcap  data  into  attack-oriented  data  format.  First,  we  dis¬ 
tinguish  the  attack  traffic  based  on  whether  the  attacks  are 
against  the  production  ports  or  against  the  non-production 
ports.  We  focus  on  the  attack  traffic  against  the  production 
ports,  which  lead  to  some  interactions  with  the  attackers. 
These  interactions  led  to  the  attack  data  that  is  suitable  for 
the  purpose  of  our  analysis. 

Second,  we  treat  the  incoming  TCP  flows  as  incoming 
attacks  because  the  honeypot-emulated  vulnerable  services 
are  based  on  TCP.  A  TCP  flow  can  be  extracted  from  the 
pcap  data  because  it  is  uniquely  identified  by  the  attacker’s 
IP  address,  the  port  used  by  the  attacker,  the  victim  IP 
address  in  the  honeypot,  and  the  port  that  is  under  attack. 
An  unfinished  TCP  handshake  is  also  treated  as  a  flow,  and 
thus  an  attack.  This  is  reasonable  because  an  unsuccessful 
handshake  can  be  caused  by  events  such  as  that  the  honeypot 
port  in  question  is  busy  (i.e.,  the  connection  is  dropped). 
For  flows  that  do  not  end  with  the  FIN  flag  (indicating  safe 
termination  of  a  TCP  connection)  or  the  RST  flag  (indicating 
unnatural  termination  of  a  TCP  connection),  we  used  the 
following  two  parameters  in  the  preprocess. 

1)  Flow  timeout  time:  A  flow  is  considered  expired  when 
no  packet  of  the  flow  is  received  during  a  time  window 
of  60  seconds,  which  is  reasonable  because  honeypot 
system  provides  limited  interactions. 

2)  Flow  lifetime:  A  flow  is  considered  expired  when  a  flow 
lives  longer  than  a  pre-determined  lifetime,  which  is  set 
to  be  300  seconds  also  due  to  the  fact  that  honeypot 
systems  provide  limited  interactions  [34], 

IV.  The  Long-Range  Dependence  Phenomena 
A.  Methodology 

We  found  it  convenient  to  analyze  honeypot-captured 
cyber  attack  data  from  a  victim  perspective,  where  a  vic¬ 
tim  corresponds  to  a  single  honeypot  IP  address.  This  is 
reasonable  because  in  production  networks,  an  IP  address 
often  (albeit  not  always)  represents  a  computer  and  reflects 
a  surface  that  is  exposed  to  the  adversary.  We  also  found  it 
convenient  to  treat  the  attacks  against  a  victim  as  a  stochastic 
process,  dubbed  stochastic  cyber  attack  process,  which  is  in 
principle  a  kind  of  Point  Process  [35].  Figure  1(a)  illustrates 
the  attacks  against  individual  victim  IP  addresses,  where 
the  dots  on  the  time  line  corresponding  to  a  specific  victim 
formulates  an  attack  process.  Figure  1(b)  further  elaborates 
the  idea,  where  a  victim  is  attacked  by  N  attackers  (or 
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attacking  computers)  and  the  attacks  arrive  at  time  ti, ...  ,tg. 
Attacks  may  target  production  ports  that  are  associated 
to  some  services,  or  non-production  ports  that  are  not 
associated  to  any  service.  Since  attacks  targeting  the  non¬ 
production  ports  can  be  simply  dropped  by  the  computer, 
we  focus  on  the  attacks  against  the  production  ports.  This  is 
reasonable  because  real-life  attacks  often  target  production 
ports  (one  exception  is  the  denial-of-service  attacks,  which 
are  out  of  scope  of  the  present  paper). 

— - . - . - . - . - . . - . - ■ — ►  VictimlP  I 

- . - . - . - . - . - . - . - . - . - .Victim  IP  2 

- . - . - . - . - . - . - . - . - . - .Victim  I P  1 

h  *2  ^3  1 4  t5  t6  t7  tB  t9  Jjme 

(a)  Illustration  of  stochastic  cyber  attack  processes  against  individual  victim 
IP  addresses.  For  a  specific  victim,  the  dots  represent  the  attack  events 
against  it.  The  attacks  against  Victim  IP  1  arrive  at  times  ti,.. .  ,tg. 

|  Attacker  computer  1  |[~  Attacker  computer  2  |  |  Attacker  computer  N  | 

|  VictimlPaddress(orcompiiter)underattack  | 

ti  t2  t3  t4  t5  t6  t7  t8  t9  Time 

(b)  Elaboration  of  the  stochastic  cyber  attack  process  against  a  specific 
victim,  where  the  attacks  arrive  at  times  ti,...,tg.  This  paper  focuses  on 
the  attacks  against  the  production  ports 

Figure  1:  Illustration  of  stochastic  cyber  attack  processes 

We  study  the  statistical  properties  of  the  attack  processes 
from  two  important  perspectives:  attack  inter-arrival  time 
and  attack  rate.  In  either  case,  we  study  both  basic  statistics 
(which  can  hint  what  kinds  of  advanced  statistical  analyses 
are  appropriate  for  further  analysis)  and  the  LRD  phe¬ 
nomenon  they  exhibit. 

B.  Basic  Statistics  of  Attack  Rate  (Per  Hour) 

Attack  rate  describes  the  number  of  attacks  that  arrive  at 
some  resolution  unit  of  time  (e.g.,  minute  or  hour  or  day). 
We  here  consider  the  hourly  attack  rate,  while  noting  that 
daily  attack  rate  is  not  appropriate  for  the  data  because  the 
individual  datasets  correspond  to  no  more  than  80  days. 

Table  II  describes,  for  each  period,  the  observed  lower- 
bound  and  upper-bound  for  each  statistic  among  the  166 
victims.  By  taking  Period  I  as  an  example,  we  observe  the 
following:  there  can  be  no  attacks  against  a  specific  victim 
during  one  hour  (as  mentioned  above,  some  victim  actually 
was  not  attacked  for  up  to  7  hours);  the  average  per-hour 
attack  rate  (among  all  the  victims)  is  some  number  between 
32  and  1810  attacks  per  hour;  the  median  per-hour  attack 
rate  is  some  number  between  8  and  1327  attacks  per  hour; 
the  maximum  number  of  attacks  against  a  single  victim  can 
be  up  to  14403. 

Figure  2  presents  the  boxplots  of  four  statistics:  mean, 
median,  maxim,  and  variance.  It  clearly  shows  that  the  five 
periods  exhibit  somewhat  similar  (homogeneous)  statistical 
properties.  For  example,  there  are  many  outliers  in  each 
period  and  for  each  statistic. 
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Figure  2:  Boxplot  of  mean  attack  rate  (per-hour).  The  colored  box 
represents  Q i,  which  represents  the  attack  rates  from  the  smallest 
25%,  to  Q 3,  which  represents  the  largest  25%  attack  rates.  The 
thick  vertical  line  within  the  colored  box  indicates  the  median 
attack  rate  among  all  the  attack  rates.  An  attack  rate  smaller  than 
Qi  —  1.5(Q3  —  Q i)  or  larger  than  Q3  +  1.5(Q3  —  Q i)  is  called 
an  outlier.  The  circles  represent  outlier  attack  rates. 

Both  Table  II  and  Figure  2  hint  that  the  variances  are 
often  much  larger  than  the  means  in  all  five  periods.  By 
looking  into  all  individual  attack  processes,  we  found  that 
among  all  the  166  victims/period  x  5  periods  =  830  victims, 
the  variance  of  attack  rate  is  at  least  3.5  times  greater  than 
the  mean  attack  rate  corresponding  to  the  same  victim.  The 
fact  —  the  variance  is  much  larger  than  the  mean  attack 
rate  —  hints  that  Poisson  models  may  not  be  appropriate 
for  describing  the  attack  processes.  This  is  confirmed  below 
via  formal  statistical  tests  that  are  presented  below. 

C.  The  Cyber  Attack  Processes  Are  Not  Poisson 

It  would  be  ideal  that  the  stochastic  cyber  attack  processes 
are  Poisson  because  we  can  easily  characterize  Poisson 
processes  with  very  few  parameters  and  there  are  many 
mature  methods  for  analyzing  them.  For  example,  we  can 
use  the  property  that  the  superposition  of  Poisson  processes 
is  still  a  Poisson  process  [25]  to  simplify  the  problems  of 
interest  (e.g.,  when  we  consider  the  attacks  against  an  entire 
network). 

Assume  that  the  attack  inter-arrival  times  are  independent 
and  identically  distributed  exponential  random  variables  with 
distribution 

F(x )  =  1  —  e~Xx ,  A  >  0,  x  >  0. 

To  test  the  exponential  distribution,  we  first  estimate  the 
unknown  parameter  A  by  the  maximum  likelihood  method. 
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Period 

(days) 

MIN(-) 

Mean(-) 

Median(-) 

Variance(-) 

MAX(-) 

LB 

UB 

LB 

UB 

LB 

UB 

LB 

UB 

LB 

UB 

I  (47) 

0 

197 

32.053 

1810.417 

8 

1327 

1589.907 

3219758.756 

247 

14403 

II  (18) 

0 

139 

49.843 

1411.986 

43 

1112 

1466.538 

1553585.624 

335 

10995 

III  (54) 

0 

94 

11.514 

1513.460 

3 

1490 

254.017 

676860.697 

125 

5287 

IV  (21) 

0 

3 

3.490 

1663.375 

1 

1184 

29.709 

2808045.223 

41 

7793 

V  (80) 

0 

35 

33.974 

2228.840 

8.5 

1526.5 

1225.635 

4639659.080 

274 

12267 

Table  II:  Basic  statistics  of  attack  rate  (number  of  attacks  per  hour).  For  a  specific  period  and  a  specific  victim  (IP  address).  Ml  ^(victim) 
and  MAX(victim)  are  respectively  the  smallest  and  largest  per-hour  attack  rates,  Mean(t lictim),  W\ed\an(victim)  and  Variance(mcfim) 
are  respectively  the  mean  per-hour  attack  rate,  the  median  per-hour  attack  rate,  and  the  variance  of  attack  rate.  For  a  specific  period 
and  a  specific  statistic  X  £  {MIN,  Mean,  Median,  Variance,  MAX},  V  victim  and  V  hour  we  have  X (victim,  hour)  >  LB  and 
X (victim,  hour)  <  UB. 


Then,  we  compute  the  KS,  CM  and  AD  test  statistics  [36], 
[37]  (cf.  Appendix  for  a  review)  and  compare  them  against 
the  respective  critical  values. 


Period 

(days) 

KS 

CM 

AD 

min 

max 

min 

max 

min 

max 

I  (47) 

0.127 

0.535 

482.295 

59543.874 

inf 

inf 

II  (18) 

0.061 

0.497 

47.082 

20437.821 

298.728 

inf 

III  (54) 

0.060 

0.651 

163.711 

51434.324 

1103.704 

inf 

IV  (21) 

0.037 

0.814 

3.438 

31376.271 

22.832 

inf 

V  (80) 

0.083 

0.652 

323.387 

214543.536 

inf 

inf 

cv 

0.01215705 

0.222 

1.132 

Table  III:  Minimum  values  of  the  three  test  statistics  for  attack 
inter-arrival  time  (unit:  second).  (Inf)  means  the  value  is  extremely 
large. 


Table  III  reports  the  minimum  test  statistics,  where  the 
critical  values  for  the  test  statistics  are  based  on  significant 
level  .05  and  obtained  from  [38],  [39],  Since  the  values  are 
far  from  the  critical  values,  there  is  no  evidence  to  support 
the  exponential  distribution  hypothesis.  Technically,  this  is 
because  the  minimum  test  statistics  violate  the  exponential 
distribution  assumption,  meaning  that  a  greater  test  statistics 
must  violate  the  exponential  distribution  assumption  as  well. 


plot  is  displayed  in  Figure  3.  We  observe  a  large  deviation 
in  the  tails.  Hence,  exponential  distribution  cannot  be  used 
as  the  distribution  of  attack  inter-arrival  times,  meaning  that 
the  attack  processes  are  not  Poisson. 

D.  LRD  Exhibited  by  the  Attack  Processes 

Similarly,  Table  IV  describes  the  minimums  and  max- 
imums  of  the  estimated  Hurst  parameters  of  attack  rates. 
Consider  Period  I  as  an  example,  we  observe  that  the  attack 
processes  corresponding  to  163  (out  of  166)  victims  have 
average  Hurst  parameters  falling  into  [.6, 1]  and  thus  exhibit 
LRD,  where  the  average  is  taken  over  the  six  Hurst  pa¬ 
rameters.  However,  only  the  attack  processes  corresponding 
to  159  (out  of  the  163)  victims  exhibit  legitimate  LRD, 
meaning  that  the  LRD  corresponding  to  4  (out  of  the  163) 
victims  are  actually  caused  by  the  non-stationarity  of  the 
processes  [28].  We  also  observe  that  in  Period  III,  there 
are  only  87  victims  whose  corresponding  attack  processes 
exhibit  legitimate  LRD.  Overall,  70%  of  the  stochastic 
attack  processes,  or  159  +  116  +  87  +  125  +  89  =  576 
out  of  166  x  5  =  830  attack  processes,  exhibit  the  LRD 
phenomenon  from  the  perspective  of  attack  rate. 


Figure  3:  QQ-plot  of  inter-arrival  time  corresponding  to  the  specific 
victim  that  exhibits  the  minimum  KS,  CM  and  AD  value  simulta¬ 
neously  (time  unit:  second) 


Mean  H  value 

Figure  4:  Boxplot  of  Hurst  parameters  of  attack  rate  corresponding 
to  individual  victims  (per  hour). 


We  also  use  QQ-plot  to  evaluate  the  goodness-of-fit  of 
exponential  distributions  for  the  attack  inter-arrival  time 
corresponding  to  the  victim  that  exhibits  the  minimum  test 
statistics  simultaneously  in  Table  III.  This  is  the  victim 
from  Period  IV  with  HKS  =  0.037,  Hc m  =  3.438  and 
H/\d  =  22.832.  If  the  attack  inter-arrival  time  corresponding 
to  this  particular  victim  does  not  exhibit  the  exponential 
distribution,  we  conclude  that  no  attack  inter-arrival  time 
in  this  dataset  exhibits  the  exponential  distribution.  The  QQ 


Figure  4  shows  the  boxplots  of  Hurst  parameters  of  attack 
rate.  We  observe  that  Periods  I  and  II  have  relatively  large 
Hurst  parameters,  suggesting  stronger  LRD.  As  we  will  see 
in  Section  V,  Period  II  is  somehow  more  resistant  against 
prediction  than  Period  I.  It  is  also  interesting  to  note  that, 
as  we  will  see  in  Section  VII  that  Periods  I  and  II  can  be 
pretty  accurately  predicted  even  for  five-hour  ahead  of  time. 
It  remains  to  be  investigated  whether  this  is  coincident  or 
causal. 
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Period 

(days) 

RS 

AGV 

Peng 

Per 

Box 

Wave 

#  of  victims  w/ 

h  e  [.6,  i] 

#  of  victims  w/ 
LRD 

min 

max 

min 

max 

min 

max 

min 

max 

min 

max 

min 

max 

1(47) 

0.533 

1.011 

0.460 

0.977 

0.664 

1.135 

0.725 

1.387 

0.545 

1.153 

0.395 

0.957 

163 

159 

II  (18) 

0.494 

0.944 

0.399 

0.979 

0.562 

1.371 

0.529 

1.687 

0.334 

1.321 

-0.551 

1.330 

130 

116 

III  (54) 

0.646 

0.952 

0.299 

0.956 

0.528 

1.059 

0.439 

1.217 

0.430 

0.979 

0.334 

1.015 

93 

87 

IV  (21) 

0.401 

1.133 

0.118 

1.004 

0.494 

1.448 

0.325 

1.738 

0.421 

1.321 

-0.337 

1.466 

126 

125 

V  (80) 

0.519 

1.006 

0.136 

0.985 

0.445 

1.219 

0.474 

1.433 

0.565 

1.304 

-0.162 

1.175 

158 

89 

Table  IV:  The  estimated  Hurst  parameters  for  attack  rate  (per  hour). 


E.  Summary 

We  observe  70%  of  the  attack  processes  exhibit  legitimate 
LRD  from  the  perspective  of  attack  rate.  Both  the  attack 
inter-arrival  time  and  the  attack  rate  with  respect  to  victim 
computers  or  IP  addresses  exhibit  the  LRD  phenomenon. 
This  means,  as  in  the  case  of  benign  traffic  [18],  [19], 
[20],  that  the  superposition  of  attack  processes  will  sustain 
the  burst  of  attacks,  rather  than  smoothing  the  burst  of 
attacks.  This  has  three  implications.  First,  defenders  should 
expect  that  the  burst  of  attacks  will  sustain.  Second,  cyber 
attack  processes  may  only  be  modeled  using  LRD-aware 
stochastic  processes.  Third,  we  need  to  explore  how  can  we 
exploit  the  presence  of  LRD  to  conduct  better  prediction  for 
the  incoming  attacks.  This  is  important  because  reasonably 
accurate  prediction  can  give  the  defenders  sufficient  early 
warning  time  to  prepare  for  the  arrival  of  attacks.  In  Section 
V  we  explore  the  feasibility  of  prediction  and  in  Section  VI 
we  explore  the  causes  of  LRD  in  this  context. 

V.  Exploiting  LRD  for  Predicting  Attack  Rate 

We  ask  the  following  natural  question:  How  many  attacks 
will  arrive  in  the  next  hours  of  time?  Intuitively,  the  model 
that  is  good  at  prediction  in  this  context  should  also  ac¬ 
commodate  the  LRD  phenomena,  meaning  that  LRD-aware 
model  might  lead  to  better  prediction.  Being  able  to  predict 
the  incoming  attacks  at  a  reasonable  accuracy  can  lead  to 
insights,  which  can  guide  the  defense  operations  of  real-life 
defenders. 

A.  Prediction  Algorithm 

Let  {Xi, . . . .  Xn  }  be  the  time  series  of  observed  attack 
rates.  The  basic  idea  behind  the  prediction  algorithm  is  to 
use  some  portion  of  the  observed  data  to  build  a  model 
(training  or  model  fitting),  which  is  then  used  to  predict  the 
attack  rates  corresponding  to  the  rest/future  portion  of  the 
observed  data  (prediction).  In  order  to  have  a  reliable  model, 
50%  of  the  observed  data  is  used  as  the  training  data  for 
building  models.  Let  h  the  input  parameter,  which  indicates 
the  step  we  will  predict  ahead  (i.e.,  the  next  time  interval 
during  which  we  predict  the  number  of  attacks  that  will 
arrive).  The  algorithm  operates  as  follows  (we  refer  to  [40] 
for  more  details  about  time-series  prediction  methodologies): 

1)  We  divide  {Xi, . . . ,  Xn}  into  two  parts,  {Xi, . . . ,  Xm} 
and  {Xm+i, . . .  ,Xn},  where  m  =  [n/2], 

2)  Repeat  the  following  steps  until  to  >  n  —  h. 

a)  Fit  {Xi, . . . ,  Xm}  to  obtain  a  model  denoted  by  Mm. 

b)  Use  Mm  to  predict  the  number  of  attacks,  denoted 
by  Ym+h,  that  will  arrive  during  the  (to  +  h)th  step. 

c)  Compute  prediction  error  em+h  =  Xrn+h  -  Ym+h. 


d)  Set  to  =  to  +  10,  i.e.,  adding  10  more  observations 
to  the  model  for  refitting  purpose  in  (a). 

B.  Prediction  of  Attack  Rate 

Since  there  are  166  processes/period  x  5  periods  =  830 
processes  prediction  results,  we  have  to  use  Table  V  to  suc¬ 
cinctly  present  the  prediction  results.  The  prediction  results 
are  with  respect  to  10-hour  ahead  predictions  during  the  last 
100  hours  of  each  time  period.  We  make  the  following  obser¬ 
vations.  First,  the  LRD-aware  FARIMA  performs  reasonably 
better  than  the  LRD-less  ARMA.  For  example,  among  the 
152  LRD  processes  in  Period  I,  FARIMA  can  predict  for 
29  victims  about  their  10-hour  ahead  attack  rates  with  at 
least  70%  overall  accuracy,  while  ARMA  can  only  predict 
for  13  victims.  If  the  defender  is  willing  to  over-provision 
some  resources  and  mainly  cares  about  the  underestimation 
error  (which  could  cause  passing  incoming  packets  without 
inspecting  it  while  not  disrupting  the  network  functions), 
FARIMA  can  predict  for  40  victims. 

Second,  the  LRD-processes  in  Period  I  render  more  to 
prediction  when  compared  with  the  LRD-processes  in  the 
other  periods. 

Third,  for  non-LRD  processes,  neither  FARIMA  nor 
ARMA  can  provide  good  predictions.  This  may  be  caused 
by  that  (some  of)  the  non-LRD  processes  are  non-stationary. 
We  plan  to  investigate  into  this  issue  in  the  future. 

C.  Summary 

We  found  that  LRD-aware  models  can  predict  attack  rates 
of  LRD-processes,  and  that  LRD-aware  models  are  relatively 
more  accurate  than  LRD-less  models  in  prediction  with 
respect  to  LRD-processes.  Moreover,  we  also  found  that 
there  are  LRD  processes  that  can  resist  the  prediction  of 
even  LRD-aware  models.  This  hints  that  in  order  to  be  able 
to  predict  incoming  attacks  hours  ahead  of  time,  we  need 
new  models. 

VI.  Exploring  Causes  of  the  LRD  Phenomenon 

We  have  observed  that  the  attack  processes  corresponding 
to  70%  (576  out  of  the  166  victims/period  x  5  periods 
=  830)  victims  exhibit  the  legitimate  LRD  phenomenon 
(i.e.,  not  caused  by  the  non-stationarity  of  the  processes  as 
discussed  in  Section  IV),  and  showed  how  to  exploit  LRD 
for  better  prediction.  Now  we  explore  the  possible  causes  of 
the  legitimate  LRD  exhibited  by  the  data.  Despite  intensive 
studies  in  other  settings,  the  fundamental  cause  of  LRD  was 
mysterious.  In  the  setting  of  the  present  paper,  we  focus  on 
exploring  two  possible  causes. 

•  In  the  context  of  benign  traffic,  it  was  also  known  that 
LRD  is  probably  caused  by  the  heavy-tail  phenomenon 
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Period 

total  #  of  victims 

((*1,  *2  )/(y)) 

#  of  victims 
w/  average 

OA  >  80% 

#  of  victims 
w/  average 

OA  >  70% 

#  of  victims 
w/  average 

OA  >  60% 

#  of  victims 
w/  average 

UA  >  80% 

#  of  victims 
w/  average 

UA  >  70% 

#  of  victims 
w/  average 

UA  >  60% 

FARIMA 

ARMA 

FARIMA 

ARMA 

FARIMA 

ARMA 

FARIMA 

ARMA 

FARIMA 

ARMA 

FARIMA 

ARMA 

I 

LRD:  (152,152)/(159) 

2 

1 

29 

13 

81 

66 

13 

4 

40 

35 

89 

68 

non-LRD:  (7,7)/(7) 

0 

0 

4 

4 

6 

6 

1 

4 

7 

6 

7 

7 

II 

LRD:  ( 1 09, 1 09)/(  116) 

0 

0 

3 

2 

9 

8 

2 

1 

12 

6 

26 

15 

non-LRD:  (50,49)/(50) 

0 

0 

0 

0 

0 

2 

4 

1 

6 

2 

13 

5 

III 

LRD:  (82,82)/(87) 

0 

0 

4 

4 

8 

9 

9 

5 

23 

19 

50 

43 

non-LRD:  (79,79)/(79) 

0 

0 

0 

0 

0 

0 

0 

0 

10 

7 

31 

24 

IV 

LRD:  (1 18,1 18)/(125) 

0 

0 

2 

2 

5 

6 

2 

3 

4 

6 

11 

14 

non-LRD:  (41,39)/(41) 

0 

0 

0 

0 

0 

0 

1 

0 

2 

0 

4 

1 

V 

LRD:  (73.73)/(89) 

0 

0 

0 

0 

2 

1 

0 

1 

2 

3 

16 

4 

non-LRD:  (77,61)/(77) 

0 

0 

0 

0 

1 

1 

0 

0 

1 

0 

24 

15 

Table  V:  Number  of  attack  processes  that  can  be  pretty  actually  predicted  by  the  LRD-aware  FARIMA  model,  which  is  more  accurate 
than  the  prediction  of  the  LRD-less  ARMA  model.  For  the  column  “total  #  of  victims  ((xi,  X2)/(y)),”  y  is  the  total  number  of  victims 
that  exhibited  LRD  (or  non-LRD),  x\  (or  *2)  is  total  number  (out  of  the  y)  of  victims  for  which  the  Maximum  Likelihood  Estimator 
used  in  the  FARIMA  (ARMA)  algorithm  actually  converges  (i.e.,  y  —  xi  and  y  —  x 2  victims  cannot  be  predicted,  where  the  limitation  is 
specific  to  MLE  rather  than  to  our  algorithm).  The  column  “#  of  victims  w/  average  OA  (or  UA)  >  z%”  represents  the  average  number 
of  victims  (out  of  the  xi  or  X2  victims  that  can  be  predicted)  that  lead  to  average  prediction  accuracy  in  terms  of  overall-accuracy  (or 
underestimation-accuracy)  is  at  least  z%,  where  average  is  taken  over  all  prediction  steps. 


that  is  exhibited  by  the  underlying  sub-processes  [1],  In 
Section  VI-A,  we  investigate  whether  or  not  the  LRD 
phenomeon  we  observed  was  caused  by  the  heavy-tail 
phemomenon  that  is  exhibited  by  the  underlying  cyber 
attack  sub-processes. 

•  It  is  intuitive  that  the  LRD  may  be  caused  by  that  many 
attackers  launch  intensive  (and  consecutive)  attacks 
against  individual  victims  (i.e.,  the  attacks  launched  by 
individual  attackers  are  bursty).  In  Section  VI-B,  we 
investigate  whether  or  not  this  is  a  probable  cause  of 
the  LRD. 

A.  Statistical  Properties  of  Port-centric  Processes 

In  order  to  investigate  whether  or  not  the  LRD  is  caused 
by  that  the  underlying  sub-processes  exhibit  the  heavy- 
tail  phenomenon,  we  need  to  decompose  an  attack  process 
corresponding  to  a  victim  into  port-centric  processes  corre¬ 
sponding  to  the  individual  ports  of  the  victim.  As  shown 
in  Figure  5,  the  attack  process  corresponding  to  the  victim, 
which  describes  the  attacks  that  arrive  at  time  ti, . . . ,  fg,  is 
the  superposition  of  M  port-centric  sub-processes,  where 
attacks  against  an  individual  port  are  treated  as  an  sub¬ 
process. 


Victim  port  1 
Victim  port  2 

Victim  port  3 


Port-centric 
cyber  attack 
sub-processes 


Victim  port  M- 
Victim  - 


ti  t2  t3  t4  t5  t6  t7  t8  tgTime 


Cyber  attack 
process  w.r.t.  a 
victim 


Figure  5:  Decomposition  of  the  cyber  attack  process  corresponding 
to  a  victim  into  port-centric  sub-processes:  the  sub-process  corre¬ 
sponding  to  victim  Port  1  describes  the  attacks  that  arrive  at  time  (2 
and  t$  (the  attacks  may  be  launched  by  one  or  two  attackers);  the 
sub-process  corresponding  to  victim  Port  2  describes  the  attacks 
that  arrive  at  fi,  te  and  fg,  etc. 


We  want  to  see  whether  or  not  the  port-centric  sub¬ 
processes  exhibit  the  heavy-tail  phenomenon.  Table  VI 


shows  that  only  8%  sub-processes  (or  56+80+47+3+32  = 
218  out  of  the  (159  +  116  +  87  +  125  +  89  =  576) 
victims  x  5  ports/victim  =  2880  ports)  exhibit  the  heavy- 
tail  phenomenon.  Moreover,  only  29  (out  of  the  576)  victims 
have  two  or  three  sub-processes  that  exhibit  the  heavy-tail 
phenomenon.  Further,  there  is  only  1  port  in  total  whose  cor¬ 
responding  sub-processes  exhibits  infinite  mean  because  the 
shape  value  >  1,  and  there  are  1  +  50  +  2  +  1  +  5  =  59  ports 
whose  corresponding  sub-processes  exhibit  infinite  variance 
because  their  shape  values  €  (.5, 1).  The  above  observations 
hint  that  unlike  in  the  setting  of  benign  traffic  [1],  the 
LRD  phenomenon  exhibited  by  the  honeypot-captured  attack 
traffic  is  probably  not  caused  by  the  heavy-tail  phenomenon 
exhibited  by  the  underlying  attack  sub-processes. 

B.  Statistical  Properties  of  Attacker-centric  Processes 

In  order  to  figure  out  whether  or  not  the  LRD  phenomenon 
was  caused  by  the  intense  (consecutive)  attacks  that  are 
launched  by  individual  attackers,  we  consider  the  attacks 
against  each  victim  that  are  launched  by  distinct  attackers. 
As  illustrated  in  Figure  6,  even  though  an  attacker  launched 
multiple  consecutive  attacks  against  a  victim,  we  only  con¬ 
sider  the  first  attack.  We  call  the  resulting  processes  attack¬ 
centric.  The  purpose  of  studying  attacker-centric  processes 
is  the  following:  If  the  attacker-centric  processes  do  not 
exhibit  the  LRD  phenomenon,  we  conclude  that  the  LRD 
phenomenon  we  observe  is  probably  caused  by  the  intensity 
of  the  attacks  that  are  launched  by  individual  attackers; 
otherwise,  the  LRD  is  probably  not  caused  by  the  intensity 
of  the  attacks. 

Table  VII  describes  the  observed  lower-bound  and  upper- 
bound  of  the  four  statistics  regarding  the  attacker-centric 
processes,  where  the  bounds  are  among  all  victims  within 
a  period  of  time.  By  taking  Period  II  as  an  example,  we 
observe  the  following:  on  average  there  are  between  48  and 
100  attackers  against  one  individual  victim  within  one  hour, 
and  there  can  be  up  to  621  attackers  against  one  individual 
victim  within  one  hour.  Further,  attacks  in  Periods  III  and 
IV  exhibit  different  behaviors  from  the  other  three  periods. 
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Period 

total  #  of 
victims 
exhibiting 
LRD 

#  of  victims  w/ 
sub-processes 
exhibiting 
heavy-tail 

#  of  victims  with 
certain  #  of  sub-processes 
exhibiting  heavy-tail 

total  # 
of  ports 
exhibiting 
heavy-tail 

Shape 

mean 

value 

#  of  ports 
w/  shape 
value 

e  (.5,1) 

#  of  ports 
w/  shape 
value 
>  1 

Standard 

deviation 

1 

2 

3 

4 

5 

I 

159 

56 

50 

6 

0 

0 

0 

62 

.1099567 

1 

0 

.1072166 

II 

116 

80 

78 

11 

1 

0 

0 

103 

.4001687 

50 

0 

.224729 

III 

87 

47 

39 

6 

2 

0 

0 

57 

.2180889 

2 

0 

.1761706 

IV 

125 

3 

3 

0 

0 

0 

0 

3 

.4303879 

1 

0 

.3454725 

V 

89 

32 

29 

1 

2 

0 

0 

37 

.2996517 

5 

1 

.2483402 

Table  VI:  Out  of  the  attack  processes  exhibit  the  LRD  phenomenon,  some  port-centric  sub-processes  exhibit  the  heavy-tail  phenomenon. 


Attacker  1 
Attacker  2 
Attacker  3 


*8 


t 


9 


Attacker  N 
Victim 


t7 


tl  *2  ^3  t4 


A  ttacker-centric 

attack  process 
Time  .  .  .. 

w.r.t.  a  victim 


Figure  6:  Attacker-centric  attack  process  after  ignoring  the  sub¬ 
sequent  attacks  launched  by  the  same  attacker.  For  example,  we 
ignore  the  attack  launched  by  Attacker  1  at  time  £5  and  the  attack 
launched  by  Attacker  2  at  times  tg  and  tg,  etc.  That  is,  the  attacker¬ 
centric  attack  process  corresponding  to  the  victim  describes  the 
attacks  that  arrive  at  times 


Period 

(days) 

Mean(-) 

Median(-) 

Variance(  ■ ) 

MAX(-) 

LB 

UB 

LB 

UB 

LB 

UB 

LB 

UB 

1(47) 

30.217 

67.752 

4 

45 

1498.064 

4094.264 

225 

432 

11  (18) 

48.546 

100.774 

42 

93 

1195.136 

6298.324 

306 

621 

III  (54) 

11.059 

32.952 

2 

29 

223.638 

270.800 

64 

100 

IV  (21) 

1.938 

23.798 

1 

23 

26.322 

92.725 

40 

65 

V  (80) 

33.434 

127.944 

8 

105 

1132.646 

7465.228 

266 

605 

Table  VII:  Basic  statistics  of  attack  rate  of  the  attacker-centric 
processes  (per  hour). 


Figure  7  uses  boxplot  to  expose  more  information  (beyond 
the  lower-bounds  and  upper-bounds  mentioned  above)  about 
the  attacker-centric  attack  processes  in  terms  of  the  same 
statistics.  It  further  shows  that  the  attackers’  behaviors  are 
actually  very  different  in  all  five  periods.  In  particular,  it 
shows  that  Period  II  has  many  outliers  in  all  four  statistics, 
meaning  that  the  attack  rate  during  this  period  varies  a  lot. 

In  order  to  see  whether  the  attacker-centric  processes 
still  exhibit  the  LRD  phenomenon,  we  describe  their  Hurst 
parameters  in  Table  VIII.  Using  Period  I  as  an  example,  we 
observe  that  there  are  153  (out  of  the  166)  victims  whose 
corresponding  attacker-centric  processes  exhibit  the  LRD 
phenomenon  because  their  average  Hurst  parameter  £  [.6, 1], 
where  the  average  is  taken  over  the  six  Hurst  parameter 
methods.  Moreover,  none  of  the  153  attacker-centric  pro¬ 
cesses  exhibit  the  illegitimate  LRD  phenomenon  caused  by 
non-stationarity  of  the  processes.  Using  Period  V  as  another 
example,  we  observe  that  all  166  victim  whose  attacker¬ 
centric  processes  have  average  Hurst  parameter  £  [-6,1]. 
However,  there  are  only  77  victims  whose  attacker-centric 
exhibit  the  LRD  phenomenon,  and  the  other  89  processes 
exhibit  the  illegitimate  LRD  phenomenon  (possibly  caused 
by  non-stationarity  of  processes).  The  above  discussion 
suggests  that  the  LRD  phenomenon  exhibited  by  the  attack 
processes  is  probably  not  caused  by  the  intense  (consecutive) 
attacks  of  the  individual  attackers. 
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Figure  7:  Boxplot  of  the  four  statistics  of  attack  rate  of  the  attacker¬ 
centric  processes  (per  hour). 


C.  Summary 


First,  we  have  confirmed  that  LRD  indeed  can  be  caused 
by  the  non-stationarity  of  the  processes.  Second,  we  have 
found  that,  in  the  cyber  security  domain  as  exhibited  by 
the  honeypot-captured  data  we  analyzed,  LRD  is  probably 
not  caused  by  that  the  underlying  port-centric  attack  sub¬ 
processes  exhibit  the  heavy-tail  phenomenon.  This  is  in 
contrast  to  the  domain  of  benign  traffic,  where  is  it  believed 
that  LRD  was  probably  caused  by  the  superposition  of  the 
underlying  processes  that  exhibit  the  heavy-tail  phenomenon 
[1].  Third,  the  LRD  exhibited  by  the  attack  processes  is 
probably  not  caused  by  the  intensive  (consecutive)  attacks 
that  are  launched  by  the  individual  attackers. 

As  such,  beyond  that  non-stationarity  of  processes  can 
cause  the  LRD  phenomenon  in  the  cyber  security  domain, 
the  fundamental  of  LRD  in  stationary  processes  remain 
mysterious.  This  calls  for  more  future  studies,  which  could 
be  rewarded  with  even  better  models  for  predicting  incoming 
cyber  attacks. 
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Period 

(day) 

RS 

AGV 

Peng 

Per 

Box 

Wave 

#  of  victims  w/ 

h  e  [.6,  i] 

#  of  victims  w/ 
LRD 

min 

max 

min 

max 

min 

max 

min 

max 

min 

max 

min 

max 

1(47) 

0.593 

0.977 

0.851 

0.958 

0.896 

1.111 

1.174 

1.334 

0.942 

1.185 

0.582 

0.843 

153 

153 

II  (18) 

0.570 

0.883 

0.616 

0.950 

0.689 

1.070 

0.710 

1.152 

0.663 

1.242 

-0.360 

0.728 

92 

77 

III  (54) 

0.776 

0.994 

0.364 

0.747 

0.630 

0.748 

0.460 

0.679 

0.608 

0.746 

0.389 

0.668 

163 

103 

IV  (21) 

0.657 

0.920 

0.273 

0.955 

0.690 

0.872 

0.559 

1.206 

0.612 

0.952 

0.288 

1.004 

166 

165 

V  (80) 

0.495 

0.758 

0.563 

0.727 

0.499 

0.806 

0.898 

1.114 

0.660 

0.977 

0.567 

0.931 

166 

77 

Table  VIII:  The  estimated  Hurst  parameters  of  the  attack  rate  of  attacker-centric  processes  (per  hour).  The  six  estimation  methods  are 
reviewed  in  the  Appendix.  Note  that  a  Hurst  value  being  negative  or  being  greater  than  1  means  that  either  the  estimation  method  is  not 
suitable  or  the  data  (process)  is  non-stationary.  The  column  “#  of  victims  w I  H  £  [.6, 1]”  represents  the  total  number  of  attacker-centric 
processes  whose  average  Hurst  parameters  (where  average  is  among  the  six  kinds  of  Hurst  parameters)  £  [.6, 1],  which  indicates  the 
presence  of  the  LRD  phenomenon.  However,  some  LRD  phenomenon  may  be  caused  by  the  non-stationarity  of  the  processes  and  should  be 
eliminated.  This  is  shown  in  the  column  “#  of  victims  w /  LRD.”  which  indicates  the  total  number  of  attacker-centric  processes  exhibiting 
the  LRD  phenomenon  that  is  not  caused  by  the  non-stationarity  of  the  processes. 


VII.  LRD  Exhibited  at  a  More  Macroscopic  Level 

In  the  above  sections,  we  investigated  how  we  identified 
the  LRD  phenomenon  as  exhibited  by  the  attack  processes 
corresponding  to  individual  victims,  how  we  could  explore 
LRD  for  better  prediction,  and  how  we  exclude  some  possi¬ 
ble  causes  of  the  LRD.  In  this  section,  we  use  the  same  sta¬ 
tistical  methodology  to  study  the  statistical  properties  of  the 
same  attack  data,  but  from  a  more  macroscopic  perspective, 
namely  by  treating  the  166  victim  honeypot  IP  addresses  as 
a  network.  This  perspective  is  equally,  if  not  more,  relevant 
to  real-life  defense  operations  because  defenders  often  look 
attacks  from  an  entire-network  perspective. 

At  this  more  macroscopic  perspective,  we  can  easily 
expose  things  because  of  the  aggregate  effect.  For  example, 
it  is  feasible  now  to  plot  the  time  series  of  the  attack  rate 
(per  hour),  namely  the  total  number  of  attacks  against  the 
entire  166  victim  space.  Figure  8  plots  the  time  series  of 
attacks  in  the  five  periods  of  time.  We  make  the  following 
observations.  First,  there  are  some  extremely  intense  attacks 
during  some  hours  some  Periods  III  and  IV.  The  specific 
hour  corresponding  to  the  extreme  value  in  Period  III  is 
Apr  01,  2011,  12  Noon  (US  Eastern  Time);  the  attacks  are 
against  the  SSH  services.  It  is  evident  that  the  attacks  are 
brute-forcing  password.  The  peak  of  attacks  during  Period 
IV  occurred  at  May  16,  2011,  3  AM  (US  Eastern  time).  The 
intense  attacks  were  against  the  HTTP  service.  We  tried  to 
find  from  the  Internet  whether  or  not  the  peaks  correspond 
to  (for  example)  the  outbreak  of  worm  or  botnet,  but  no 
decisive  conclusion  can  be  drawn.  Second,  the  five  periods 
exhibit  different  attack  patterns.  For  example.  Periods  I, 
II  and  V  are  relatively  stationary.  Although  the  five  plots 
exhibit  some  change-points,  a  formal  statistical  analysis 
(using  the  method  for  removing  illegitimate  LRD,  which  was 
reviewed  in  Section  II)  shows  that  there  are  some  change- 
points  in  Period  III,  which  correspond  to  the  largest  attack 
rates. 


Period  I 


(a)  Period  I 


Period  III 


(c)  Period  III 


Period  II 


(b)  Period  II 


Period  IV 

(d)  Period  IV 


0  500  1000  1500 

Period  V 


(e)  Period  V 

Figure  8:  Time  series  plots  of  the  five  periods.  The  five  periods 
exhibit  different  patterns.  The  a>axis  is  indicates  the  relative  time 
with  respect  to  the  start  time  for  each  period. 


A.  Identifying  the  LRD  Phenomenon 

Table  IX  describes  the  basic  statistics  of  the  attack  rate  as 
exhibited  by  the  attack  processes  corresponding  to  the  victim 
network  during  the  five  periods  of  time.  We  observe  that 
on  average,  the  victim  network  is  least  intensively  attacked 
during  Period  IV  because  the  average  hourly  attack  rate  is 
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about  9861,  which  is  substantially  smaller  than  the  average 
attack  rate  during  the  other  periods. 


Period  (days) 

MIN 

Mean 

Median 

Variance 

Max 

1(47) 

2572 

30963.180 

28263 

401243263.217 

151189 

II  (18) 

5155 

31576.838 

29594 

167872818.950 

98527 

III  (54) 

6732 

20382.262 

19579 

72436071.490 

196210 

IV  (21) 

637 

9861.132 

6528 

93209085.323 

89718 

V  (80) 

1417 

18960.215 

15248.5 

205276388.408 

120221 

Table  IX:  The  basic  statistics  of  attack  rate  corresponding  to  the 


victim  network  during  the  five  periods  of  time. 

Table  X  describes  the  six  kinds  of  Hurst  parameters 
corresponding  to  the  five  attack  super-processes.  Although 
the  Hurst  parameters  suggests  that  they  all  exhibit  the 
LRD  phenomenon,  a  further  analysis  showed  the  the  LRD 
phenomenon  exhibited  by  Period  III  is  illegitimate,  meaning 
that  it  is  caused  by  the  non-stationarity  of  the  process. 


Period 

RS 

AGV 

Peng 

Per 

Box 

Wave 

Legitimate  LRD? 

I 

0.795 

0.945 

0.877 

1.031 

0.996 

0.754 

Yes 

II 

0.737 

0.594 

0.855 

0.746 

0.967 

0.842 

Yes 

III 

0.740 

0.522 

0.646 

0.626 

0.628 

0.645 

No 

IV 

1.053 

0.971 

0.949 

1.072 

0.967 

1.216 

Yes 

V 

0.741 

0.781 

0.739 

1.026 

0.797 

0.795 

Yes 

Table  X:  The  estimated  Hurst  parameters  for  attack  rate  corre¬ 


sponding  to  the  victim  network  during  the  five  periods  of  time. 


that  LRD-aware  FARIMA  can  effectively  predict  the  attack 
rate  pretty  accurately  even  five  hours  ahead.  This  would  give 
the  defender  enough  early-warning  time. 

Second,  Period  III  does  not  exhibit  LRD.  However,  both 
the  LRD-aware  FARIMA  and  the  LRD-less  ARMA  models 
can  predict  incoming  attacks  up  to  five  hours  ahead.  Indeed, 
the  prediction  accuracy  of  FARIMA  is  slightly  smaller  than 
the  accuracy  of  ARMA.  This  also  confirms  that  if  an  attack 
process  does  not  exhibit  LRD,  it  is  better  not  to  use  LRD- 
aware  prediction  models;  if  an  attack  process  does  exhibit 
LRD,  it  can  be  better  to  use  LRD-aware  prediction  models. 

Third,  although  Period  IV  exhibited  LRD,  even  its  one- 
hour  ahead  FARIMA  prediction  is  not  good  enough,  with 
over  50%  error.  While  it  is  unclear  what  caused  this  fact, 
we  note  that  the  underestimation  error  for  five  hour  ahead 
prediction  is  still  reasonable  for  Period  IV  (22.6%  for 
PMAD^^jj^  and  PMAD'lfiMyl).  This  means  that  if  one 
is  willing  to  over-provisioning  defense  resource  to  some 
extent,  then  the  prediction  for  Period  IV  is  perhaps  still 
useful. 

Fourth,  Period  V  resists  both  prediction  models  in  terms 
of  both  overall  accuracy  and  underestimation  accuracy.  The 
fundamental  cause  of  this  is  unknown,  and  left  for  future 
studies.  Nevertheless,  we  suspect  that  the  Extreme  Value 
Theorem  could  be  exploited  to  solve  this  problem. 


B.  Exploiting  LRD  for  Predicting  Incoming  Attacks 

Now  we  report  how  we  can  exploit  the  LRD  phenomenon 
to  enhance  the  accuracy  of  predicting  the  rates  of  attacks 
against  the  victim  network. 

Table  XI  describes  the  prediction  accuracy  when  treating 
the  166  victim  IP  addresses  as  a  victim  network.  We  observe 
the  following.  First,  for  Period  I  and  II,  both  one-hour 
ahead  and  five-hour  ahead  FARIMA  predictions  are  pretty 
accurate,  with  error  no  greater  than  22%.  However,  the  ten- 
hour  ahead  FARIMA  prediction  is  pretty  bad.  This  means 


C.  Exploring  Cause  of  LRD 

We  also  wanted  to  know  whether  or  not  the  legiti¬ 
mate  LRD  phenomenon  exhibited  by  the  four  attack  super¬ 
processes  against  the  victim  network  during  Periods  I,  II, 
IV  and  V  are  caused  by  the  superposition  of  underlying  pro¬ 
cesses  that  exhibited  the  heavy-tail  phenomenon.  That  is,  we 
want  to  know  how  many  attack  processes  —  among  the  166 
attack  processes  corresponding  to  the  victims  during  each 
of  the  four  periods  —  exhibit  the  heavy-tail  phenomenon. 
Using  the  methodology  as  in  Section  VI,  we  found  that 
among  the  vector  of  (166, 166, 166, 166)  attack  processes 


Period  | 

PMADfarima 

PMADarma 

PMADpARIMA 

PMADarma 

One-hour  ahead  prediction  (p  —  I 

D.5) 

I 

0.1790597 

0.4457219 

0.1727588 

0.1570264 

II 

0.2171558 

0.3629267 

0.1488566 

0.1493358 

III 

0.2981511 

0.2729375 

0.3048926 

0.3119207 

IV 

0.5483212 

0.5256603 

0.1259078 

0.1065005 

V 

0.5165464 

0.5285076 

0.4238902 

0.4105138 

Five-hour  ahead  prediction  ( p  =  1 

0.5) 

I 

0.2061686 

0.5563662 

0.2920678 

0.3137478 

II 

0.2121247 

0.3511250 

0.4196604 

0.4108466 

III 

0.2968155 

0.2723984 

0.2459617 

0.2496851 

IV 

0.8465184 

0.8381369 

0.2263648 

0.2067392 

V 

0.5262515 

0.5552034 

0.4143909 

0.4171769 

Ten-hour  ahead  prediction  ( p  =  ( 

3.5) 

I 

0.8685109 

0.8011304 

0.3142247 

0.2812013 

II 

1.0237179 

1.0336110 

0.2768460 

0.2840824 

III 

1.0044999 

1.0019792 

0.2015967 

0.2013436 

IV 

0.6477370 

0.6271771 

0.2816179 

0.4896072 

V 

0.9815581 

0.9519172 

0.4017764 

0.4117593 

Table  XI:  Accuracy  of  predicting  the  rate  of  incoming  attacks 
against  the  victim  network  using  the  LRD-aware  FARIMA  and  the 


]during  Period  I,  II,  IV  and  V,  the  vector  of  processes 
-.that  exhibit  the  heavy-tail  phenomenon  is  correspondingly 
'(101,0,24,31).  This  means  that  Period  I  is  the  only  period 
-during  which  a  majority  of  processes  exhibited  the  heavy- 
tail  phenomenon.  A  few  or  even  none  processes  in  the 
three  other  periods  exhibited  the  heavy-tail  phenomenon. 
-This,  again,  suggests  that  the  LRD  phenomenon  exhibited 
(by  the  honeypot-captured  data  at  a  more  macroscopic  level 
-probably  does  not  share  the  cause  that  is  believed  for  benign 
traffic  [1], 

-D.  Summary 

By  looking  at  the  same  attack  data  but  from  a  more 
"macroscopic  perspective,  we  can  draw  similar  findings  as 
to  the  ones  obtained  when  we  look  at  the  attack  data  from  a 


LRD-less  ARMA  models  and  the  measures  specified  in  Section  II. 
Basically,  the  smaller  the  value  PMADfarima  (PMADarma),  the 
more  accurate  the  FARIMA  (ARMA)  prediction,  p  =  0.5  means 
that  we  start  predicting  in  the  midpoint  of  the  length  of  each  time 
period. 


more  microscopic  perspective.  Specifically,  among  the  five 
attack  super-processes  corresponding  to  the  victim  network 
during  five  periods,  LRD  was  exhibited  by  four  of  them. 
We  also  found  that  LRD-aware  models  can  —  but  not 
necessarily  always  —  predict  the  incoming  attacks  more 
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accurately  than  LRD-less  models,  and  that  LRD-less  models 
are  more  appropriate  for  prediction  in  dealing  with  processes 
that  do  not  exhibit  LRD.  Finally,  we  re-affirmed  that  LRD 
as  exhibited  by  the  honeypot-captured  data  is  probably  not 
caused  by  the  superposition  of  attack  processes  that  exhibit 
the  heavy-tail  phenomenon. 

VIII.  Limitations 

This  study  has  some  limitations. 

Data  size 

The  dataset,  albeit  over  47  +  18  +  54  +  21  +  80  =  220 
days  in  total  (five  periods  of  time),  only  corresponds  to 
166  honeypot  IP  addresses.  We  wish  to  have  access  to 
significantly  larger  dataset  of  this  kind.  Nevertheless,  it 
is  notoriously  difficult  to  get  such  data  for  various  rea¬ 
sons.  For  example,  it  appears  that  even  the  PREDICT 
project  (https://www.predict.org/)  does  not  offer 
this  kind  of  data.  Still,  this  paper  touches  an  important 
direction  for  cyber  security  research  because  of  the  potential 
reward  in  understanding  the  statistical  properties  of  cyber 
attacks  and  in  possibly  predicting  the  incoming  attacks  with 
reasonable  accuracy.  The  statistical  analysis  methodology 
used  in  the  present  paper  would  be  useful  to  researchers 
who  have  access  to  larger  datasets  of  this  kind. 

Low-interaction  vs.  high-interaction  honeypot 

The  dataset  is  attack-agnostic  in  the  sense  that  we  know  the 
ports/services  the  attackers  attempt  to  attack,  but  to  a  large 
extent  we  do  not  know  the  specific  attack  details.  This  is 
mainly  attributed  to  that  the  data  was  collected  using  low- 
interactive  honeypots.  Although  this  issue  can  be  resolved 
with  data  that  is  collected  using  high-interactive  honeypots, 
we  are  fully  aware  of  the  legitimate  concerns  about  such 
studies  from  a  legal  perspective.  Nevertheless,  our  method¬ 
ology  is  equally  applicable  to  the  analysis  of  cyber  attack 
data  that  is  captured  by  high-interactive  honeypots. 

Honeypot  vs.  production  network 

The  data  was  collected  using  honeypot  rather  than  using 
production  network.  For  real-life  adoption  of  the  prediction 
capability  presented  in  the  paper,  attack  traffic  would  be 
blended  into  the  production  traffic.  Whether  or  not  the 
blended  traffic  also  exhibits  the  LRD  phenomenon  is  an 
interesting  future  study  topic.  The  main  challenge  here  is 
to  set  up  such  a  data  collection  instrument,  which  natu¬ 
rally  incurs  legitimate  non-technical  (e.g.,  legal  or  privacy) 
concerns.  Indeed,  we  do  not  have  access  to  such  data.  (It 
may  not  be  a  good  idea  to  simply  blend  the  honeypot  traffic 
with  benign  production  traffic  because  this  might  disrupt  the 
attack  process  structure.) 

IX.  Conclusion  and  Future  Work 

We  introduced  the  concept  of  stochastic  cyber  attack 
processes,  which  offers  a  new  perspective  for  studying  cyber 


attacks.  We  also  proposed  a  statistical  methodology  for  ana¬ 
lyzing  the  processes.  By  applying  the  methodology  to  some 
honeypot-collected  attack  data,  we  found  that  majority  of  the 
attack  processes  exhibit  the  LRD  (Long-Range  Dependence) 
phenomenon.  We  demonstrated  that  by  using  LRD-aware 
models  we  can  predict  the  attack  rates  five-hours  ahead, 
especially  when  aggregating  the  victims  into  a  network. 
This  hints  that  attacks  against  enterprise-level  networks  are 
probably  more  predictable  than  attacks  against  individual 
computers.  The  power  comes  from  “gray-box”  (rather  than 
“black-box”)  prediction. 

Future  research 

The  present  study  introduces  a  range  of  interesting  problems 
for  future  research.  First,  we  need  to  further  improve  the 
prediction  accuracy,  even  though  our  study  showed  that  the 
FARIMA  model  can  predict  the  attack  process  fairly  well 
(especially  when  aggregating  the  victims  into  a  network). 
For  this  purpose,  we  plan  to  use  some  advanced  models 
with  high  volatilities. 

Second,  we  hope  to  rigorously  explain  the  fundamental 
cause  of  the  LRD  phenomenon  as  exhibited  by  the  honeypot- 
captured  attack  data.  As  shown  in  the  paper,  the  cause  of 
LRD  is  mysterious,  although  we  know  that  it  is  probably 
not  caused  by  the  superposition  of  processes  that  exhibit  the 
heavy-tail  phenomenon,  and  that  it  is  probably  not  caused 
by  the  intense  attacks  launched  by  individual  attackers. 

Third,  we  observed  that  the  attacks  at  the  scale  of  individ¬ 
ual  IP  addresses  (as  investigated  in  Sections  IV-VI)  and  the 
attacks  at  the  scale  of  victim  networks  (i.e.,  by  aggregating 
the  166  victim  IP  addresses  into  a  network  as  shown  in 
Section  VII)  exhibit  similar  phenomena.  This  hints  a  sort 
of  scale-invariance  that,  if  turns  out  to  hold,  would  have 
extremely  important  implications  (for  example)  in  achieving 
scalable  analysis  of  cyber  attacks.  In  order  to  resolve  this 
issue,  larger  datasets  are  needed  albeit  extremely  hard  to 
obtain. 
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Appendix 

A.  Methods  for  Estimating  Hurst  Parameter 

We  use  seven  major  Hurst  parameter  estimation  methods 
for  evaluating  LRD  [22].  Note  that  one  should  not  just  report 
Hurst  value  based  on  one  method,  but  should  check  several 
Hurst  values  based  on  different  estimation  methods  [41], 
[42], 

1)  RS  method.  For  a  time  series  {Xt,  t  >  1},  with 
partial  sum  Yt  =  X,,  and  sample  variance 

S2  =  j  E'=1  Xf  —  (yj)2  Yf,  the  R/S  statistic  is  defined 
as 


4)  Per  (Periodogram)  Method.  One  first  calculates 


/(A) 


1 

2t tN 


N 


where  A  is  the  frequency,  N  is  the  number  of  terms  in 
the  series,  and  Xt  is  the  data.  A  series  with  LRD  should 
have  a  periodogram,  which  is  proportional  to  A1  211  for 
A  «  0.  Therefore,  a  regression  of  the  logarithm  of  the 
periodogram  on  the  logarithm  of  the  frequency  gives 
coefficient  1  —  2  H. 

5)  Box  (Boxed  Periodogram)  method.  The  boxed  peri¬ 
odogram  method  was  developed  specifically  to  deal 
with  the  problem  of  having  most  of  the  points,  which 
are  used  to  estimate  H,  on  the  right-hand  side  of  the 
graph. 

6)  Wave  (Wavelet)  method.  Wavelets  can  be  thought  of  as 
akin  to  Fourier  series  but  using  waveforms  other  than 
sine  waves.  The  estimator  used  here  fits  a  straight  line 
to  a  frequency  spectrum  derived  using  wavelets  [24]. 


B.  Heavy-tail  Distributions 

A  random  variable  X  is  said  to  belong  to  the  Maximum 
Domain  of  Attraction  (MDA)  of  the  extreme  value  distribu¬ 
tion  He  if  there  exists  constants  cn  £  R+,  dn  £  K.  such  that 
its  distribution  function  F  that  satisfies 


R 

Js 


(n) 


max 

0<£<n 


min 

0<£<n 


For  LRD  series,  we  have 


E 


CnnH , 


as  n  — >  oo,  where  Ch  is  a  positive,  finite  constant 
independent  of  n. 

2)  AGV  (Aggregated  variance)  method.  Divide  time  series 
{Xt,t  >  1}  into  blocks  of  size  to.  The  average  within 
a  block  is 


km 

I(m)W  =  -  V  Xi,  k=  1,2.... 

i=(k-l)m-\-l 


Take  the  sample  variance  of  X^m\k),  k  =  1,2,... 
within  each  block.  The  sample  variance  is  an  estimator 
of  Var(2f(m)).  For  LRD  series,  we  have 

Var  ^X^™^  ~  cm_/3,  m  — >  oo, 

where  /?  =  2H  —  2,  and  c  is  a  finite  positive  constant 
independent  of  to. 

3)  Peng  Method.  The  series  is  broken  up  into  blocks  of 
size  to.  Compute  partial  sums  Y(i),  i  =  1,2..., to. 
within  blocks.  Fit  a  least-square  line  to  the  Y(i)’s 
and  compute  the  sample  variance  of  the  residuals. 
This  procedure  is  repeated  for  each  of  the  blocks,  and 
the  resulting  sample  variances  are  averaged.  It  can  be 
shown  that  the  resulting  number  is  proportional  to  m2H 
for  LRD  series. 


linr  nF(cnx  +  dn )  =  HXx). 

n—f  oo  s 


It  is  known  that  for  £  =  1/a  >  0,  F  £  MDA(iTc)  is 
equivalent  to  the  distribution  function  satisfies 


F(x)  =  1  —  x  aL{x), 


where  L(x)  is  a  slow  varying  function  L  (cf.  [25]).  In 
statistics,  X  is  said  to  follow  a  heavy-tailed  distribution 
if  F  £  MDA(TT^).  Note  that  [1]  if  X  has  a  heavy-tailed 
distribution,  then  its  tail 


F(cy) 

F(c) 


y  a,  y  >  0,  c  — >■  oo, 


where  F(c)  =  1  —  F(c).  There  are  many  methods  for 
estimating  the  parameter  a  [25],  [1].  In  the  following,  we 
briefly  review  the  widely-used  method,  called  Point  Over 
Threshold  (POT). 

Let  Xi, . . . ,  Xn  be  independent  and  identically  distributed 
random  variables  from  F  £  MDA  (/A),  then  we  may  choose 
a  high  threshold  u  such  that 


lim  sup  \Fu(x)  —  Gji/3(„)(a;)|  =  0, 

U—fXF  0 <X<Xf~U 

where  xp  is  the  right  end  poind  point  of  X ,  and 


Fu(x)  =  P{X  —  u  <  x\X  >  u),  x  >  0, 

and  =  1  —  G^  ^^)  is  the  survival  function  of 

generalized  Pareto  distribution  (GPD) 

%W(*)=(  (1  +  ^)  ’  ^°’  , 

{  exp{— x//3},  $  =  0. 
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where  x  €  R+  if  £  €  M+,  and  x  €  [0, —/?/£]  if  £  €  R_. 

£  and  /3  are  called  shape  and  scale  parameters.  The  POT 
method  states  that  if  Xi, ,  Xn  are  heavy-tailed  data,  then 
[Xi  —  u\Xi  >  u\  follows  a  generalized  Pareto  distribution. 


C.  Goodness-of-fit  Test  Statistics 

In  statistics,  the  goodness-of-fit  of  a  distribution  describes 
how  well  a  distribution  fits  a  set  of  observations.  We 
use  three  commonly  used  goodness-of-fit  test  statistics: 
Kolmogorov-Smirnov  (KS),  Anderson-Darling  (AD)  and 
Cramer-von  Mises  (CM). 

Let  Xi, . . .  ,Xn  be  independent  and  identical  random 
variables  with  distribution  F.  The  empirical  distribution  Fn 
is  defined  as 

1  n 

Fn{x)  =  -  I  (Xi  <  x), 
n 

i=i 


where  I(X,  <  x)  is  the  indicator  function: 


I  (Xi  <  x) 


1  ■  Xi  '  X, 
0,  o/w. 


The  KS  test  statistic  is  defined  as 


KS  =  -v/n-sup  | Fn(x)  —  F(x) \ , 

X 

and  the  CM  test  statistic  is  defined  as 

CM  =  n  J  (. Fn{x )  —  F(x))2dF(x). 

The  AD  test  statistic  is  defined  as 

AD  =  „/(F„W-rW)2«.W<irW, 

where  w(x)  =  [T1(x)(l  —  ^(x))]”1.  Note  that  AD  test 
is  a  variant  of  CM  test  by  giving  more  weights  to  the 
observations  in  the  tail  of  the  distribution,  which  is  useful  in 
detecting  outliers.  Based  on  samples,  the  CM  and  AD  test 
statistics  can  be  computed  as  follows. 

•  Estimate  the  unknown  parameter(s)  for  distribution  F, 
and  make  the  transform  =  F(x^),  for  i  = 

1 n,  where  xi  <  X(2)  <  . . .  <  X(n). 

•  Calculate  CM  and  AD  as  follows  [36],  [37]: 

n 

CM  =  £ 

i- 1 

and 

1  " 

AD  =  2*-l)  [log(w(*))  -  log(l  -  u[n+1_i})]  . 


u(i) 


2i-  1 
2  n 
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Abstract.  Botnets  have  become  a  major  threat  in  cyberspace.  In  or¬ 
der  to  effectively  combat  botnets,  we  need  to  understand  a  botnet’s 
Command-and-Control  (C&C),  which  is  challenging  because  C&C  strate¬ 
gies  and  methods  evolve  rapidly.  Very  recently,  botmasters  have  begun 
to  exploit  social  network  websites  (e.g.,  Twitter.com)  as  their  C&C  in¬ 
frastructures,  which  turns  out  to  be  quite  stealthy  because  it  is  hard  to 
distinguish  the  C&C  activities  from  the  normal  social  networking  traffic. 
In  this  paper,  we  study  the  problem  of  using  social  networks  as  botnet 
C&C  infrastructures.  Treating  as  a  starting  point  the  current  generation 
of  social  network-based  botnet  C&C,  we  envision  the  evolution  of  such 
C&C  methods  and  explore  social  networks-based  countermeasures. 
Keywords:  Botnet,  command-and-control,  social  networks,  security. 


1  Introduction 

The  critical  difference  between  botnets  and  other  malware  is  that  botmasters 
use  a  Connnand-and-Control  (C&C)  to  coordinate  large  numbers  of  individual 
bots  (i.e. ,  compromised  computers)  to  launch  potentially  much  more  damaging 
attacks.  Botmasters  also  evolve  their  C&C  strategies  and  methods  very  rapidly 
to  evade  defenders’  countermeasures.  Therefore,  from  a  defender’s  perspective, 
it  is  always  important  to  understand  the  trends  and  practices  of  botnet  C&C  [8, 
15,19,23,29,11,26].  Previous  studies  have  mainly  focused  on  two  approaches: 
host-centric  [31, 36]  and  network- centric  [16, 18, 17, 3,  7,  8, 14, 23, 20, 6].  The  host¬ 
centric  approach  aims  to  detect  suspicious  host  activities,  such  as  the  use  of 
incoming  network  data  as  system  call  arguments.  The  network-centric  approach 
attempts  to  detect  suspicious  network  activities  by  (for  example)  identifying 
network  traffic  patterns.  The  fact  that  a  social  network-based  botnet  C&C  on 
Twitter.com  was  detected  by  “digging  around”  [25]  suggests  that  we  need  to 
pursue  more  detection  approaches. 

Our  contributions.  Through  an  application- centric  approach,  we  study  the 
problem  of  botnets  that  use  social  network  websites  as  their  C&C  infrastruc¬ 
tures.  First,  we  characterize  the  current-generation  of  social  network-based  bot¬ 
net  C&C,  describing  their  strengths  and  weaknesses.  Our  characterization,  while 


inspired  by  [25],  is  broader  and  deeper.  Second,  we  envision  how  current  social 
network-based  botnet  C&C  might  evolve  in  the  near  future,  which  capitalize  on 
their  strengths  while  diminishing  their  weaknesses.  Third,  we  explore  counter¬ 
measures  for  dealing  with  both  current  and  future  generations  of  social  network- 
based  botnet  C&C.  Since  social  network  providers  as  well  as  client  machines 
are  victims  of  a  social  network-based  botnet  C&C,  both  server-side  and  client- 
side  countermeasures  are  demonstrated  and  tested  for  both  effectiveness  and 
performance.  Fourth,  we  discuss  the  limitations  of  the  application-centric  ap¬ 
proach  demonstrated  in  this  paper,  which  suggests  the  need  to  integrate  it  with 
the  aforementioned  host-centric  and  network-centric  methods  because  the  three 
approaches  are  complementary  to  each  other. 

Paper  outline.  Section  2  discusses  related  prior  work.  Section  3  presents  a 
characterization  of  the  current  generation  of  social  network-based  botnet  C&C. 
Section  4  envisions  the  next-generation  of  social  network-based  botnet  C&C. 
Sections  5  and  6  investigate  server-end  and  client-end  solutions  to  detecting 
social  network-based  botnet  C&C,  respectively.  Section  7  discusses  how  to  in¬ 
tegrate  them  and  how  they  can  benefit  from  host-centric  and  network-centric 
approaches.  Section  8  describes  future  work  and  concludes  the  paper. 

2  Related  Work 

Network-centric  approach.  This  approach  aims  to  detect  botnets  by  cor¬ 
relating  the  network  traffic  of  a  computer  population,  including  destination  IP 
addresses,  server  names,  packet  content,  event  sequences,  crowd  responses,  proto¬ 
col  graphs  and  spatial-temporal  relationships  [16, 18, 17, 3,  7,  8, 14, 23, 20, 6].  This 
approach  is  especially  useful  when  only  network  traffic  data  is  available. 
Host-centric  approach.  This  approach  aims  to  differentiate  malicious  from 
benign  processes  running  on  host  computers  by  observing  that  bot  processes 
often  use  data  received  from  the  network  as  parameters  in  system  calls  [31]. 
A  detection  technique  based  on  a  high  rate  of  failed  connection  attempts  to 
remote  hosts  was  recently  presented  in  [36] ,  which  does  not  necessarily  apply  to 
the  type  of  botnets  we  consider  in  the  present  paper  because  the  connections  are 
to  popular  social  networking  sites  and  are  generally  successful.  This  approach 
often  looks  deeply  into  the  software  stack  [36] . 

Application-centric  approach.  This  approach  looks  into  the  application- 
specific  interactions.  Previously,  the  focus  has  been  put  on  IRC-based  botnets 
(see,  e.g.,  [14]).  Recently,  the  possibility  of  exploiting  emails  as  a  botnet  C&C 
was  investigated  [30] ,  and  the  feasibility  of  detecting  such  botnets  through  their 
resulting  spam  traffic  was  presented  in  [35,34,37,22].  In  this  paper  we  consider 
a  specific  class  of  applications,  namely  web-based  social  networking.  The  con¬ 
cept  of  social  network-based  botnet  C&Cs  can  be  dated  back  to  2007  [21, 12, 28, 
13],  but  such  botnets  became  a  reality  only  very  recently  [2,25].  In  particular, 
[25]  served  as  the  starting  point  of  the  present  study.  It  should  be  noted  that 
the  focus  of  the  present  paper  is  only  remotely  related  to  the  abuse  of  social 
networks  for  other  purposes  [1]. 


3  Characterizing  Current  Social  Network-Based  Botnet 


How  Does  Current  Social  Network-Based  Botnet  C&C  Work?  Jose 
Narario  first  reported  the  actual  use  of  social  network  as  a  botnet  C&C  [25], 
although  the  concept  had  been  proposed  as  early  as  2007  [21, 28].  We  call  such  a 
bot  Naz,  after  its  discoverer.  At  a  high-level,  Naz’s  botnet  used  accounts  with  the 
name  upd4t3  owned  by  the  botmaster  on  social  network  sites  Twitter.com  and 
Jaiku.com.  These  bots  received  Base64-encoded  commands  from  these  accounts 
to  download  malicious  payloads  onto  the  victimized  bot  computers.  Since  then, 
other  C&Cs  were  discovered  with  variations  of  the  botnet’s  scheme,  such  as  the 
Twitter.com  account  Botn3tControl,  which  was  shutdown  days  later. 


Fig.  1.  Naz’s  C&C  attack  flow  (left)  and  control  flow  (right) 


To  understand  the  behavior  of  Naz’s  botnet,  we  conducted  two  experiments. 
The  first  confirmed  and  extended  Naz’s  C&C  flow  reported  in  [25].  Specifically, 
we  obtained  and  ran  a  Naz  sample  on  a  machine  by  replacing  its  references  to 
Twitter.com  with  a  server  under  our  control.  This  was  necessary  because  the 
Twitter.com  account  was  shutdown  by  Twitter. corn’s  administrators  shortly 
after  its  detection.  We  used  our  own  Base64-encoded  messages  with  a  bit.ly 
URL  we  set  up  that  redirected  to  the  payload  stored  on  our  server.  The  payload 
was  a  Base64-encoded,  compressed  archive  containing  two  files:  gbpm.exe  and 
gbpm .  dll.  These  files  were  the  originally  identified  payload  package  of  Naz.  In  our 
analysis,  we  reconstructed  the  original  C&C  flows  described  below  and  depicted 
in  the  left-hand  side  of  Figure  1. 

1.  The  bot  makes  a  HTTP  GET  request  to  upd4t3’s  RSS  (Really  Simple  Syndi¬ 
cation)  feed  at  Twitter.com. 

2.  Twitter.com  returns  the  RSS  feed,  containing  Base64-encoded  text  in  the 
description  nodes. 

3.  The  bot  decodes  the  Base64-encoded  text,  revealing  one  or  more  bit.ly 
URLs,  and  makes  a  HTTP  GET  request  to  each.  The  bit.ly  website  provides 
short  aliases  for  long  URLs. 

4.  Each  bit.ly  URL  redirects  to  a  malicious  zip  file  hosted  on  an  independent 
attack  server. 


5.  The  bot  downloads  the  malicious  zip  file  as  a  payload. 

6.  The  bot  decodes  and  unzips  the  payload,  replicates  itself  and  the  payload’s 
uncompressed  files,  and  then  executes  the  payload’s  contents. 

7.  The  payload  attempts  to  gather  user  information  from  the  victim  computer 
and  send  it  to  a  server  selected  by  the  botmaster. 

To  have  a  deeper  understanding  of  the  internal  control  flow  of  Naz,  we  con¬ 
ducted  further  black-box  testing  using  data  provided  by  Network  Monitor  [9] 
and  CWSandbox  [10],  from  which  we  draw  the  control  flow  details  on  the  right- 
hand  side  of  Figure  1.  We  observed  that  the  bot  made  a  copy  of  itself  and  the 
two  files  mentioned  above  in  a  temporary  directory,  and  that  when  executing 
gbpm.  exe,  the  bot  dynamically  injected  code  into  gbpm.  exe’s  process.  Moreover, 
we  observed  that  Naz  handled  unexpected  inputs  as  follows: 

1.  When  we  provided  a  URL  to  a  bogus  RSS  feed,  the  bot  failed  to  connect  and 
instead  attempted  to  access  an  RSS  feed  from  a  Jaiku.com  account.  This  ac¬ 
count  name  was  hardwired  in  the  bot  program  and  had  been  deactivated  by 
Jaiku. corn’s  administrator.  This  second  connection  failure  led  the  bot  to  issue 
a  DNS  query  on  the  domain  name  mansvelt .  freehostia.  com,  after  which 
the  bot  stopped  producing  network  traffic.  The  site  mansvelt .  freehostia.  com 
is  currently  unregistered  and  has  no  IP  address. 

2.  When  we  placed  plaintext  sentences  and  URLs  in  our  in-house  RSS  feed,  the 
bot  read  the  RSS  feeds  but  did  not  show  evidence  of  decoding  and  using  the 
text  in  connection  attempts. 

3.  When  we  modified  the  payload  file  to  Base64-encoded  only,  compressed  only, 
and  replaced  it  with  an  executable  file,  the  bot  did  not  attempt  to  unzip  or 
execute  the  file’s  contents.  When  we  renamed  the  two  payload  hies  gbpm.  exe 
and  gbpm.dll,  the  bot  did  not  attempt  to  execute  the  renamed  gbpm.exe 
hie,  implying  that  the  bot  was  program  name  sensitive. 

Finally,  it  is  interesting  to  note  that  a  dynamic  analysis  of  gbpm.dll  revealed 
that  the  payload  attempted  to  connect  to  a  bank  in  Brazil.  Moreover,  both  the 
analysis  in  [25]  and  our  independent  experiments  demonstrate  that  Naz’s  botnet 
C&C  serves  primarily  as  a  Trojan  downloader  [32]. 

Strengths  of  Naz’s  Botnet  C&C.  Naz’s  botnet  C&C  has  the  following  ad¬ 
vantages  when  compared  with  other  botnet  C&C  infrastructures  and  methods: 

1.  Abusing  trusted  popular  websites  as  a  C&C  server.  Social  networks 
and  Web  2.0  sites  such  as  Twitter.com,  FaceBook.com,  Linkedln.com,  and 
YouTube.com  are  not  only  legitimate,  with  verihable  SSL  or  EV-SSL  certifi¬ 
cates,  but  also  heavily  used  by  millions  of  users.  Due  to  this  heavy  usage, 
light  occasional  traffic  to  one  or  more  accounts  is  unlikely  to  be  noticed 
compared  to  a  user’s  actual  traffic  pattern.  This  avoids  any  unnecessary  and 
sometimes  suspicious  DNS  queries  (e.g.,  for  non-popular  DNS  names). 

2.  Exploiting  popular  port(s)  for  C&C  communication.  Port  80  is  the 
de  facto  standard  for  web  traffic,  and  most  network  traffic  will  how  through 
it.  This  helps  bots  blend  in  with  benign  traffic. 


3.  Abusing  application  features  for  automatic  C&C.  The  botmaster 
uses  application  features,  such  as  RSS  feeds,  to  automatically  update  bots. 
Moreover,  the  commands  are  so  light-weight  that  they  cannot  be  easily  dis¬ 
cerned  from  normal  social  network  traffic. 

The  above  discussion  demonstrates  that  botmasters  have  begun  to  exploit  “ hid¬ 
ing  in  plain  sight ”  to  conduct  stealthy  botnet  C&C.  By  piggybacking  on  the 
reputation  and  legitimacy  of  social  network  websites,  botnet  C&C  activities  may 
remain  hidden,  while  defeating  the  “many  eyes”  defense  [30]. 

Weaknesses  of  Naz’s  Botnet  C&C.  We  need  to  understand  the  weaknesses 
of  current  generation  of  social  network-based  botnet  C&Cs  because  these  weak¬ 
nesses  will  likely  be  absent  in  future  generations.  This  is  not  meant  to  help  the 
attackers,  rather  it  is  meant  to  help  the  defenders  look  ahead.  Our  examination 
shows  that  Naz’s  botnet  C&C  has  weaknesses,  which  are  omitted  due  to  space 
limitation. 


4  Envisioning  Future  Social  Network-Based  Botnet 

In  order  to  defeat  future  social  network-based  botnets,  we  must  think  ahead  of 
the  attackers.  For  this  purpose,  we  can  show  how  the  aforementioned  weaknesses 
of  the  current  generation  of  social  networks-based  botnet  C&Cs  can  be  avoided. 
Due  to  space  limitation,  details  are  omitted. 

5  Server-side  Countermeasures 

5.1  The  Detection  Mechanism 

A  key  observation  behind  our  detection  mechanism  is  that,  regardless  of  the 
channel,  provider,  or  account,  social  network  messages  are  in  text.  Thus,  if  bot¬ 
masters  want  to  use  social  networks  for  their  C&C,  they  would  encode  their 
commands  textually.  Moreover,  just  like  legitimate  messages  may  include  web 
links,  so  might  C&C  messages  (e.g.,  links  for  downloading  payload).  These  ob¬ 
servations  inspired  us  to  distinguish  between  encoded  and  plain  texts  and  to 
follow  unencoded  links  to  their  destination.  Our  detection  mechanism  can  be 
adopted  by  the  Webserver  as  shown  in  Figure  2,  and  the  resulting  system  would 
operate  as  follows  (with  steps  2  and  3  relevant  to  our  countermeasure): 

1.  Alice  logs  into  her  social  network  and  updates  her  status  display  using  a 
content  form. 

2.  The  social  network’s  content  updater  sends  the  text  content  to  our  server-end 
system. 

3.  The  detection  mechanism,  which  will  be  implemented  as  a  classifier  in  our 
prototype  system,  determines  if  the  text  is  suspicious  and  returns  a  result. 

4.  The  content  form  updates  the  database  with  the  message  and  whether  it 
was  marked  suspicious. 


5.  Bob  check’s  Alice  content  display,  either  through  a  feed  like  RSS  or  by  vis¬ 
iting  the  site. 

6.  The  content  display  requests  the  content  from  the  database. 

7.  The  database  returns  non-suspicious  content;  if  a  threshold  level  of  suspicious 
messages  (determined  by  policy)  has  been  reached,  the  database  returns  a 
“suspicious  account”  message. 

8.  The  content  display  shows  those  retrieved  messages  to  the  user,  or  a  “suspi¬ 
cious  account”  message  if  the  suspicious  message  threshold  has  been  reached. 


Bob  Social  Network 


Fig.  2.  Example  scenario  using  our  server-side  detection  mechanism 


The  server-side  detection  mechanism  has  the  following  advantages.  First,  it 
is  account  agnostic  because  it  looks  for  text  attributes  that  are  shared  with  en¬ 
coded  text  rather  than  individual  behavioral  patterns.  Second,  it  is  language 
agnostic  because  it  looks  at  text  for  attributes  that  are  shared  with  encoded 
text  rather  than  individual  words.  As  a  result,  the  detection  mechanism  is  effec¬ 
tive  for  any  language  using  Roman  characters  (English,  Spanish,  French,  etc.). 
Third,  it  is  easy  to  deploy  because  it  can  take  advantage  of  light-weight  machine 
learning  algorithms  and  thus  make  decisions  in  real-time.  Moreover,  the  code  is 
easy  to  deploy  as  a  library  or  software-as-a-service.  Fourth,  it  can  follow  unen¬ 
coded  links  to  determine  if  the  destination  is  a  trusted  source,  say  by  using  SSL 
authentication  as  a  trust  infrastructure.  In  the  next  two  subsections  we  analyze 
the  effectiveness  and  performance  of  our  approach. 


5.2  Prototype  Implementation  and  its  Effectiveness  and  Limitations 

Prototype  implementation.  To  demonstrate  the  effectiveness  of  our  system, 
we  instantiated  the  detection  mechanism  as  Weka’s  [33]  J48  decision  tree  algo¬ 
rithm  (because  it  is  quick  and  readily  usable,  but  other  tailored  algorithms  can 
be  used  instead  in  a  plug-and-play  fashion)  to  classify  input  messages  so  as  to 
distinguish  between  Base64-  or  Hexadecimal-encoded  text  and  natural  language 
text.  For  links  in  the  clear,  by  following  links  to  their  destination,  we  can  mark 
the  content  as  “suspicious”  if  it  is  an  atypical  file  (e.g.,  an  executable,  library, 


encoded,  or  compressed  file,  or  a  combination  of  these).  To  build  a  pool  of  “non- 
suspicious”  text,  we  screenscraped  200  Twitter.com  accounts  to  build  a  list  of 
4000  messages.  Our  pool  of  bot  commands  were  400  short  random  commands  of 
fifteen  to  thirty  characters  that  were  then  encrypted  using  RC4  stream  cipher 
and  then  encoded,  giving  a  10:1  set  of  normal  to  suspicious  text.  We  then  split 
the  messages  into  a  training  set  with  70%  of  both  types  and  a  test  set  with  the 
remaining  30%.  Recognizing  that  altering  the  natural  Base64  or  Hexadecimal 
alphabet  with  alternate  characters  such  as  spaces  or  punctuation  could  be  used 
to  obfuscate  the  text,  we  also  ran  our  classifier  against  such  alternate  encoding 
schemes. 

Effectiveness.  For  standard  Base64  and  Hexadecimal  encoding  schemes,  our 
classifier  was  able  to  quickly  distinguish  between  our  “normal”  and  “suspicious” 
text  samples  in  an  account-agnostic  way,  no  false  positives  and  no  false  negatives, 
for  both  Base64  and  Hexadecimal  encoding  (see  Table  1).  Moreover,  our  classifier 
maintained  this  accuracy  even  when  the  commands  were  obfuscated  with  other 
words — the  distinctiveness  of  the  encoded  commands  was  readily  apparent.  The 
results  were  so  perfect  because  the  attributes  we  used—  number  of  spaces  in 
the  text,  longest  word,  and  shortest  word — cleanly  divided  the  “normal”  and 
“suspicious”  text.  To  produce  non-standard  Base64  and  Hexadecimal  encoding 


|  Base64 

|  Hexadecimal  | 

|  Alt.  Base64  | 

|Alt.  Hexadecimalj 

Actual 

Positive 

Actual 

Negative 

Actual 

Positive 

Actual 

Negative 

Actual 

Positive 

Actual 

Negative 

Actual 

Positive 

Actual 

Negative 

Tested  Positive 

100% 

Wo 

100% 

Wo 

100% 

1.25% 

96.75% 

IT5% 

Tested  Negative 

Wo 

100% 

Wo 

100% 

Wo 

98.75% 

3.25% 

WTWo 

Table  1.  Results  with  respect  to  various  Base64  and  Hexadecimal  encodings 


schemes,  we  randomly  swapped  ten  of  the  standard  alphabet  with  alternate 
ones  from  a  pool  of  space  and  punctuation  characters.  Our  profiler  was  able 
to  distinguish  between  them  in  an  account-agnostic  way,  with  a  false  positive 
rate  of  0.0%  and  false  negative  rate  of  1.25%  for  Alternate  Base64  encodings, 
and  a  false  positive  rate  of  3.25%  and  false  negative  rate  of  12.5%  for  Alternate 
Hexadecimal  encodings  (see  Table  1). 

The  classifier’s  accuracy  dropped  significantly  when  the  commands  were  ob¬ 
fuscated  with  other  words,  especially  with  Hexadecimal  encoding  schemes  and 
with  spaces  as  alternate  characters.  We  note  that  with  such  an  encoding  mecha¬ 
nism,  a  priori  knowledge  of  words  or  characters  to  excise  from  the  message  would 
be  necessary  to  extract  the  non-command  content  from  the  meaningful  botnet 
commands.  This  form  of  steganography  is  essentially  indistinguishable  from  typ¬ 
ical  steganography,  where  a  botmaster  would  hide  the  bot  commands  in  such  a 
way  as  to  not  attract  attention  to  themselves,  i.e. ,  using  natural  language  words 
as  code  for  commands  or  URLs. 

Limitations.  Hiding  commands  in  a  social  network-based  C&C  using  steganog¬ 
raphy  makes  it  difficult  for  programs  or  even  humans  to  identify  the  presence 
of  a  command  within  a  message.  Since  social  network  messages  left  by  users 


are  unstructured  content,  a  crafty  adversary  can  hide  a  bot  command  within 
a  message  in  such  a  way  that  a  human  reading  the  message  could  not  identify 
the  message  as  a  command.  Combined  with  encryption,  reverse  analysis — even 
with  a  captured  bot — may  not  yield  the  interpreted  commands.  Thus,  running  a 
steganographic  reversal  algorithm  on  a  C&C  message  would  not  return  data  that 
was  a  clear  bot  command.  [30].  However,  our  server-side  solution  coupled  with 
a  client-side  counterpart  that  detects  when  a  process  is  acting  on  input  from  a 
social  network-based  C&C  would  provide  a  complete  solution  to  this  emerging 
threat.  In  Section  7  we  will  discuss  how  these  limitations  may  be  overcome  in  a 
bigger  solution  framework. 

5.3  Performance 

Evaluation  methodology  and  environment  setup.  In  order  to  demonstrate 
the  efficiency  of  our  server-side  detection  mechanism,  we  measured  the  perfor¬ 
mance  of  our  prototype.  We  implemented  it  into  CompactSocial,  a  microblogging 
service  that  emulates  the  constraints  of  Twitter.com.  CompactSocial  provides  a 
simple  interface  to  both  update  a  status  message  and  view  any  account’s  mes¬ 
sages  using  a  web  browser.  Moreover,  an  auto-updated  RSS  feed  contains  the 
text  of  the  last  ten  account  updates.  CompactSocial  was  written  in  Java  6,  up¬ 
date  11  and  ran  as  a  web  application  deployed  to  Apache  Tomcat  6.0.20.  When 
used  as  a  library,  our  server-end  solution  was  deployed  as  a  .jar  file;  when  de¬ 
ployed  as  a  service,  the  classifier  ran  as  a  stand-alone  web  application  deployed 
to  Apache  Tomcat  6.0.20.  The  classifier  and  CompactSocial  used  shared  crypto¬ 
graphic  keys  for  authentication.  For  processing  incoming  and  outgoing  messages, 
Javas  crypto  library  was  used  to  compute  any  hash  value  or  HMAC  it  needed, 
in  both  cases  using  the  SHAl  algorithm. 

The  system  environment  is  depicted  in  Figure  3.  Both  servers  reside  within 
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Fig.  3.  Integrating  the  server-side  solution  into  real  life  systems 


a  university  campus  network,  and  the  CompactSocial  clients  are  both  within  and 
outside  the  campus  network.  The  CompactSocial  and  text-classifier  servers  are 
called  mercury  and  apollo,  respectively.  There  are  two  CompactSocial  client 
machines:  minerva  acted  as  an  external  computer  within  the  LAN  with  autho¬ 
rized  access  to  mercury  through  a  simple  CompactSocial  client,  and  mars  was  an 
adversary  client  machine  within  the  campus  network,  employing  Naz-I-  to  read 
updates  made  by  minerva.  A  fifth  machine,  venus  mimicked  the  minerva’s  func¬ 
tionality  and  tested  the  performance  of  the  classifier  on  a  non-dedicated  internet 


connection.  The  three  servers,  hermes,  jupiter,  and  euclid  recognized  each  other 
by  sharing  some  pair-wise  keys.  Table  2  reviews  the  concrete  configurations  of 
the  machines  and  networks. 


Machine 

Internet  Connection 

Relevant  Software 

mercury 

Gigabit  LAN 

Apache  Tomcat  6.0.20,  Sun  JVM,  CompactSocial 

apollo 

Gigabit  LAN 

Apache  Tomcat  6.0.20,  Sun  JVM,  classifier  service 

minerva 

Gigabit  LAN 

Firefox  3.5,  CompactSocial  client 

mars 

Gigabit  LAN 

.NET  3.5  Framework,  Naz+ 

venus 

100  Megabit  Cable 

Firefox  3.5,  CompactSocial  client 

Table  2.  System  settings  (all  machines  use  Intel  Core  2  Duo,  2.93  GHz  processor) 


A  CompactSocial  client  was  developed  in  JavaScript  to  simulate  a  user  up¬ 
dating  their  status  in  the  CompactSocial  web  application.  The  CompactSocial 
client  was  developed  as  an  addon  and  installed  into  Mozilla  Firefox.  Addition¬ 
ally,  Naz+  was  developed  as  a  Windows  service  and  written  in  C#,  targeting 
the  Microsoft  .NET  Framework,  version  3.5.  Naz+  periodically  checked  the  RSS 
feed  for  CompactSocial  test  account,  parsed  the  XML  for  the  description  node 
which  contained  the  bot  command  as  encrypted  text  (using  a  shared  key  with 
mars  who  updated  the  status  message),  and  executed  the  command. 
Performance  benchmarking.  To  examine  the  delay  incurred  by  the  classifier 
when  utilized  by  CompactSocial,  time  was  marked  before  and  after  each  trans¬ 
action  over  30000  requests  at  varying  rates.  We  repeated  this  test  ten  times  and 
took  the  average  over  the  runs.  Figure  4  shows  the  results  over  varying  requests 
per  second  when  the  classifier  was  used  as  a  library  and  a  service. 


Library  Performance  Analysis  Service  Performance  Analysis 


Fig.  4.  Classifier  library  (left)  and  service  (right)  performance  analysis 


When  used  as  a  library,  our  classifier  performed  roughly  twice  as  fast  than  its 
service  counterpart,  since  no  network  traffic  was  required.  At  even  500  requests 


per  second,  the  classifier  handled  all  requests  without  incident.  In  practice,  a 
large-scale  enterprise  would  use  our  server-side  classifier  as  a  service,  whether 
as  an  in-house  server  to  make  requests  or  a  pay-as-you-go  service  to  a  third 
party.  In  our  preliminary  testing,  the  classifier  service  handles  500  requests  per 
second  on  a  non-dedicated  machine  with  cycles  to  spare.  In  the  advent  that  a 
larger  throughput  was  necessary,  a  load  balancer  can  reduce  the  requests  for  a 
particular  machine  to  a  manageable  level. 

Meeting  the  needs  of  real-life  systems.  For  a  social  network  provider  like 
Twitter.com,  that  has  fifteen  million  users,  and  is  increasing  at  roughly  one  mil¬ 
lion  users  a  month,  we  wondered  how  our  classifier  would  stack  up.  A  large  per¬ 
centage  (85.3%)  of  Twitter.com  accounts  post  less  than  once  a  day,  whether  due 
to  lost  interest  or  not  having  much  to  post.  We  classify  these  accounts  as  “passive 
users”.  The  vast  majority  (99.5%)  of  accounts  post  less  than  sixteen  messages 
daily,  which  would  be  a  class  of  “active”  users,  who  post  regularly  about  events. 
The  remaining  0.5%  post  more  than  sixteen  updates,  although  Twitter.com  re¬ 
stricts  accounts  to  a  1000  updates  per  day  limit.  These  are  particularly  engaged 
users  or  are  shared  accounts  by  multiple  people  in  an  organization  posting  under 
a  common  account.  We  classify  these  users  as  “explosive”  users  [5]. 

If  all  accounts  hit  their  ceilings  on  Twitter.com,  we’d  have  the  posting  rates 
in  Table  3.  With  fifteen  million  users,  Twitter.com  would  handle  a  ceiling  of 
1410.7  messages  per  second  (in  actuality,  most  users  do  not  hit  these  ceilings, 
so  the  actual  threshold  is  far  less).  In  this  worst  case,  assuming  the  one  million 
account  per  month  growth  rate  and  the  same  distribution  of  account  usage, 
Twitter.com  will  accommodate  an  uptick  per  month  of  nearly  100  messages  per 
second. 


15000000  users 

Percentage 

Update  Rate 

Messages  Per  Day 

Messages  Per  Second 

Passive  users 

85.3% 

1/ day 

12795000 

148.1 

Active  users 

r  14.2% 

16/day 

34080000 

394.5 

Explosive  users 

oJ% 

1000/day 

75000000 

868.1 

Total  users 

100%" 

— 

121875000 

1410.7 

Table  3.  Twitter.com  usage  analysis 


Because  our  server-side  approach  is  account  agnostic,  it  does  not  need  to  build 
an  account  history  for  each  user  and  as  a  result,  would  only  need  to  check  a  few 
messages  to  determine  if  the  account  is  being  used  in  a  suspicious  way.  Given 
the  above  scenario  and  a  policy  that  checks  periodically  verifies  one  message 
daily  and  the  first  three  messages  for  a  new  user’s  account,  then  our  classifier 
would  only  need  to  check  fifteen  million  messages  per  day,  or  173.7  messages 
per  second,  with  a  monthly  increase  of  12.1  messages  per  second.  If  only  active 
and  explosive  accounts  were  targeted  (which  would  be  more  likely  behavior  for  a 
botnet  C&C),  this  would  decrease  to  25.6  messages  per  second  with  an  increase 
of  2.1  messages  per  second.  Thus,  even  with  these  simple  non-discriminating 
policies  and  worst-case  Twitter.com  usage  scenario,  our  classifier  as  a  service  can 


handle  an  enterprise-level  throughput  of  requests,  and  different  policy  strategies 
may  be  employed  to  throttle  down  the  throughput  further. 


6  Client-side  Countermeasures 

6.1  The  Detection  Mechanism 

Detection  attributes.  We  propose  detecting  social  network-based  botnet  C&Cs 
using  three  attributes:  self-concealing,  dubious  network  traffic,  and  unreliable 
provenance. 

—  Self-Concealing:  A  self-concealing  process  is  one  which  attempts  to  avoid 
detection  with  the  use  of  stealth  mechanisms.  We  consider  two  specific  in¬ 
stances  of  this  type: 

Graphical  User  Interface  Many  applications  that  read  RSS  feeds  interact  with 
a  user  via  a  graphical  user  interface.  Bots  and  other  malware  will  attempt 
to  avoid  detection,  and  as  a  result  may  run  in  the  background  as  a  service 
or  hidden  process  without  an  explicit  interface.  A  process  without  a 
graphical  user  interface  can  be  identified  as  possibly  self-concealing. 
Human  Computer  Interaction  Most  benign  software  works  by  reacting  to  user 
input  via  a  keyboard  or  mouse.  Malware  processes  tend  to  run  hidden 
and  independent  of  user  input  and  don’t  require  explicit  keyboard  or 
mouse  events  provided  by  a  user  to  perform  a  nefarious  act.  A  process 
without  human/computer  interaction  can  be  identified  as  possibly  self- 
concealing. 

—  Dubious  Network  Traffic:  A  process  with  dubious  network  traffic  is  one 
which  engages  in  network  communication  with  another  machine  in  a  covert 
or  devious  way.  We  consider  three  specific  instances  of  this  type: 

Social  Network  Request  Exclusively  visiting  social  networking  sites  is  not  sus¬ 
picious;  however,  social  network-based  botnet  C&C  craftily  abuse  the 
popularity  and  good  name  of  social  networking  sites;  thus,  exclusive 
requests  to  social  networking  or  web  2.0  sites  is  considered  a  possible 
trigger  event  for  dubious  network  traffic. 

Encoded  Text  Processing  Since  social  network-based  bots  read  commands  as 
encoded  text,  processes  making  connections  to  sources  that  provide  en¬ 
coded  or  encrypted  text  is  anomalous.  Accepting  connections  with  en¬ 
coded  text  and  processing  it  by  decoding  or  decrypting  it  can  be  consid¬ 
ered  as  possibly  dubious  network  traffic. 

Suspicious  File  Downloading  In  general,  applications  do  not  download  suspi¬ 
cious  files  such  as  executable,  library,  compressed,  or  encoded/encrypted 
files  without  permission  (though  they  may  download  image  or  text  files). 
Social  network-based  C&C  bot  processes,  on  the  other  hand,  act  as  Tro¬ 
jan  downloaders  and  almost  exclusively  save  executables  or  DLLs  to  the 
filesystem  as  malicious  payload.  Downloading  such  suspicious  files  can 
be  considered  as  dubious  network  traffic. 


—  Unreliable  Provenance:  A  process  with  unreliable  provenance  is  one 
which  lacks  a  reliable  origin.  We  consider  three  specific  instances: 

Self- Reference  Replication  This  is  a  feature  malware  uses  to  survive  disinfec¬ 
tion  on  a  host  machine,  occurring  when  a  process  copies  itself  into  a 
newly  created  file  or  an  existent  file  (by  modifying  it)  on  the  file  system 
[24],  An  installed  file  with  its  installer  not  having  a  verified  signature 
can  be  identified  as  possibly  having  an  unreliable  provenance. 

Dynamic  Code  Injection  This  is  used  by  malware  to  insert  malicious  code 
into  the  memory  space  of  an  active  process.  Its  end  goal  is  to  modify  the 
process  to  perform  nefarious  deeds,  possibly  by  piggybacking  on  that  ap¬ 
plication’s  authorization  settings.  A  process  whose  injector  lacks  a  digital 
signature  can  be  identified  as  possibly  having  an  unreliable  provenance 
since  the  injector’s  origin  cannot  be  established. 

Verifiable  Digital  Signature  Digital  signatures  may  be  considered  a  hallmark 
of  trust  between  users  and  well  established  software.  Most  organizations 
that  publish  software  provide  a  signature  for  their  program  and  related 
files.  Malware  authors  typically  do  not  employ  digital  signatures;  as  a 
consequence,  a  process  running  without  a  verifiable  digital  signature  can 
be  identified  as  possibly  having  an  unreliable  provenance. 

Detection  model.  We  say  a  process  P  has  the  self-concealing  attribute  ( Psc ) 
if  it  lacks  a  graphical  user  interface  {Pgui  =  false)  and  does  not  accept  human 
computer  interaction  ( Phd  =  false).  More  formally, 

{-'Pgui)  A  (-'Phci)  —>  Psc- 

We  say  a  process  P  has  the  dubious  network  traffic  attribute  ( Pdnt )  if  it  performs 
social  network  requests  ( Psnr  =  true)  and  encoded  text  processing  ( Petp  =  true) 
or  does  suspicious  file  downloading  {Psfd  =  true)  (or  both).  More  formally, 

Psnr  A  {Petp  V  Psfd)  *  Pdnt- 

We  say  a  process  P  has  the  unreliable  provenance  attribute  {Pup)  if  it  performs 
self-reference  replication  ( Psrr  =  true)  or  does  dynamic  code  injection  {Pdd  = 
true),  and  also  lacks  a  verified  digital  signature  {Pvds  =  false).  More  formally, 

{Psrr  V  Pdci)  A  (“ 1 Pvds )  *  Pup- 

Putting  the  above  altogether,  we  classify  a  process  P  as  being  suspicious  of  being 
a  social  network-based  bot  C&C  process  {Psnbb)  if  it  is  either  self-concealing 
{Psc  =  true)  or  has  an  unreliable  provenance  {Pup  =  true)  (or  both),  and  engages 
in  dubious  network  traffic  {Pdnt  =  true).  More  formally, 

( Psc  V  Pup)  A  Pdnt  *  Psnbb  • 


6.2  Effectiveness  and  Limitations 

Evaluation  methodology  and  environment  setup.  To  examine  the  effec¬ 
tiveness  of  the  detection  model  described  above,  we  collected  data  with  respect 


to  our  detection  attributes  for  both  benign  and  malicious  processes  in  order  to 
distinguish  them.  For  this  purpose,  we  considered  eighteen  benign  applications, 
four  traditional  bots,  and  the  malicious  Naz  and  prototype  Naz+  bots  (listed 
in  Table  4).  To  provide  a  wide  breadth,  the  benign  applications  are  a  broad 
selection  of  the  most  popular  web  browsers,  RSS  aggregators,  Twitter  clients, 
and  RSS  aggregators  which  read  subscription  feeds.  Testing  was  performed  using 


Type 

Type 

Type 

AOL  Explorer 

Web  Browser 

Internet  Explorer 

Web  Browser 

RSS  Bandit 

RSS  Aggregator 

Avant 

Web  Browser 

K-Meleon 

Web  Browser 

RSS  Owl 

RSS  Aggregator 

Bob  ax 

Traditional  Bot 

Maxthon 

Web  Browser 

SeaMonkey 

Web  Browser 

BlogBridge 

RSS  Aggregator 

Mercury 

RSS  Aggregator 

Snarfer 

RSS  Aggregator 

FeedDemon 

RSS  Aggregator 

Naz 

SN-Based  Bot 

Tweetdeck 

Twitter  Client 

FireFox 

Web  Browser 

Naz+ 

SN-Based  Bot 

Twhirl 

Twitter  Client 

Flock 

Web  Browser 

Opera 

Web  Browser 

Virut 

Traditional  Bot 

Google  Chrome 

Web  Browser 

Ozdok 

Traditional  Bot 

Waledac 

Traditional  Bot 

Table  4.  Client-side  test  set  (“SN-Based  Bot”  stands  for  “Social  Network-Based  Bot”) 


VMWare  Workstation  running  Microsoft  Windows  SP3  using  NAT  for  Internet 
access.  Each  application  was  executed  separately  for  a  period  of  four  hours,  fol¬ 
lowed  by  post-analysis.  During  testing  of  the  eighteen  benign  applications,  we 
interacted  with  each  application  by  subscribing  to  and  viewing  different  RSS 
feeds;  attempting  to  subscribe  to  bogus  RSS  feeds,  updating  all  RSS  feeds  ev¬ 
ery  hour,  reading  individual  feed  articles.  These  tests  were  done  to  provide  a 
wide  range  of  expected  and  unexpected  scenarios  for  each  application  to  deal 
with  while  recording  their  behavior.  In  addition,  we  used  a  number  of  sensors  to 
gather  information  about  each  process.  Tracing  of  the  three  detection  attributes 
described  in  Section  6.1  occurred  from  the  moment  a  process  starts  executing. 

Network  traffic  was  collected  using  Windows  Network  Monitor  [9] .  Keyboard 
and  mouse  input  was  collected  with  a  modified  version  of  GlobalHook.  Digi¬ 
tal  signatures  were  verified  using  SigCheck.  Self-reference  replication  and  dy¬ 
namic  code  injection  were  accomplished  with  kernel  hooks  implementing  known 
techniques  [24].  The  presence  of  a  user  interface  was  recorded  by  observing  the 
creation  of  any  window  upon  executing  each  application  using  EasyHook.  User 
input,  network  traffic,  graphical  user  interface  interaction,  self-reference  replica¬ 
tion  and  dynamic  code  injection  were  all  recorded  in  real-time.  For  Virut  and 
Waledac,  static  analysis  and  previous  executions  of  these  bots  yielded  their  du¬ 
bious  network  traffic  results.  Digital  signatures  were  verified  after  the  four  hour 
testing  of  each  process. 

Effectiveness.  The  results  are  summarized  in  Table  5  and  highlight  some  ob¬ 
servations  below.  First,  we  observe  that  all  benign  applications  lacked  the  self- 
concealing  attribute  as  they  all  utilized  a  graphical  user  interface  and  accepted 
inputs  from  the  user,  such  as  reading  an  article,  following  a  link,  or  updating 
an  RSS  feed.  All  bots  but  Virut  demonstrated  the  self-concealing  attribute  since 
they  did  not  have  a  graphical  user  interface  or  accept  user  input,  although  it 
appears  that  the  command  prompt  window  Virut  displays  may  be  accidental. 


Application 

|  Self-Concealing 

|  Unreliable  Provenance 

Dubious  Network  Traffic 

Result 

Graphical 

User 

Interface 

Human 

Computer 

Interaction 

Self- 

Reference 

Replication 

Dynamic 

Code 

Injection 

Verifiable 

Digital 

Signature 

Social 

Network 

Request 

Encoded 

Text 

Processing 

Suspicious 

File 

Download 

Social 
Network- 
Based  Bot? 

AOL  Explorer 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Avant 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

BlogBridge 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

FeedReader 
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Y 

N 

N 

Y 

N 

N 

N 

N 

Firefox 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Flock 
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Y 

N 

N 

Y 

N 

N 

N 

N 

IE 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Chrome 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

K-Meleon 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Maxthon 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Mercury 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Opera 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

RSS  Bandit 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

RSS  Owl 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

SeaMonkey 

Y 

Y 

N 

N 

Y 

N 

N 

N 

N 

Snarfer 

V 

Y 

N 

N 

Y 

N 

N 

N 

N 

Tweetdeck 

Y 

Y 

N 

N 

N 

Y 

N 

N 

N 

Twhirl 

Y 

Y 

N 

N 

N 

Y 

N 

N 

N 

Bobax 

N 

N 

Y 

Y 

N 

N 

Y 

Y 

N 

Ozdok 

N 

N 

Y 

Y 

N 

N 

Y 

Y 

N 

Virut 

Y 

N 

Y 

Y 

N 

N 

N 

Y 

N 

Waledac 

N 

N 

Y 

Y 

N 

N 

Y 

Y 

N 

Naz 

N 

N 

Y 

Y 

N 

Y 

Y 

Y 

Y 

Naz+ 

N 

N 

N 

N 

N 

Y 

Y 

Y 

Y 

Table  5.  Client-Side  detection  results  (“IE”  stands  for  “Internet  Explorer”) 


Second,  all  applications  but  Naz  and  Naz+  possessed  the  dubious  network  traffic 
attribute;  RSS  applications  did  not  download  suspicious  files,  but  all  of  the  bots 
did.  All  bots  but  Virut  read  inordinate  amounts  of  encoded  text;  legitimate  RSS 
reader  applications  generally  did  not.  Additionally,  only  Naz  and  Naz-I-  com¬ 
municated  nearly  exclusively  with  social  networking  sites  (benign  processes  read 
from  them  only  a  fraction  of  the  time,  and  traditional  bots  made  no  such  commu¬ 
nication  requests).  Third,  all  benign  applications  did  not  exhibit  the  unreliable 
provenance  attribute;  none  attempted  to  replicate  itself  or  inject  code  into  an¬ 
other  process,  and  they  all  possessed  a  verifiable  digital  signature.  On  the  other 
hand,  all  bots  tested  displayed  this  attribute,  as  they  all  copied  themselves  or 
injected  code  into  other  processes  (except  Naz+),  and  lacked  a  verifiable  digital 
signature. 

Of  note  is  that  no  social  network-based  bot  was  misclassified  as  a  benign 
process  and  no  benign  application  or  traditional  bot  was  misclassified  as  a  so¬ 
cial  network-based  bot.  We  reiterate  that  our  goal  is  to  detect  social  network- 
based  botnet  C&C,  which  explains  why  the  other  bots  Bobax,  Ozdok,  Virut  and 
Waledac  were  not  classified  as  social  network-based  bots.  These  bots  may  be 
detected  by  using  complementary  host-centric  or  network-centric  approaches,  or 
by  applying  the  relaxed  dubious  network  traffic  attribute  described  above.  This 
also  justifies  why  the  countermeasures  presented  in  the  paper  need  be  integrated 
into  a  comprehensive  defense  framework. 

Limitations.  A  limitation  of  our  effectiveness  analysis  is  the  lack  of  real-time 
analysis  of  other  social  network-based  botnet  C&C,  due  to  other  botnets  undis¬ 
covered  in  the  wild.  Moreover,  our  analysis  is  conducted  in  post-analysis.  We 
plan  to  develop  an  implementation  of  this  technique  to  provide  real-time  data 
gathering  and  evaluation.  Another  limitation  is  that  any  one  sensor  can  be  de¬ 
feated  if  the  malware  author  knows  the  concrete  details  of  its  implementation; 


knowledge  of  the  high-level  detection  model  is  not  enough.  For  example,  if  we 
only  seek  .  exe  and  compressed  files,  a  malicious  file  can  purposely  be  renamed 
to  .jpg  or  .html  which  would  bypass  our  file  download  sensor.  In  this  case,  the 
bot’s  internal  logic  will  have  extra  overhead,  possibly  checking  every  file  in  the 
network  traffic  with  these  file  extension  to  identify  the  malicious  one,  thus  mak¬ 
ing  this  approach  infeasible.  Malware  in  general  are  known  to  have  attributes 
that  trigger  many  of  these  sensors  [32]  and  thus  it  is  unlikely  that  a  bot  process 
will  effectively  work  and  bypass  all  our  sensors. 

6.3  Performance  Analysis 

In  order  to  measure  the  client-side  countermeasure’s  performance,  we  used  Pass- 
Mark’s  PerformanceTest  7.0  benchmarking  to  gather  information  on  its  CPU 
usage  [27].  We  benchmarked  the  baseline  case  with  no  data  collection  running, 
and  then  ran  the  data  collector  tracking  zero  to  five  processes  after  an  hour 
of  data  collection  had  passed.  Running  the  data  collector  added  a  4.8%  over- 


CPU  Benchmarking 


Fig.  5.  CPU  utilization  for  our  client-side  countermeasure  process  tracker 


head  to  the  overall  system,  and  running  it  to  track  one  to  five  processes  had 
between  a  13.3%  and  28.9%  overhead  (See  also  Figure  5).  With  optimization, 
these  numbers  could  be  lowered  further. 

7  Integrating  Server-  and  Client-side  Countermeasures 

Integrated  solution.  As  discussed  above,  the  server-side  and  client-side  coun¬ 
termeasures  are  integral  parts  of  our  client/server  solution  to  detecting  social 
network-based  botnet  C&C.  When  each  type  of  countermeasures  deployed  alone, 
they  can  be  defeated  in  certain  ways.  Integrating  our  server-side  classifier  into  a 
social  network  Webserver  is  straightforward  whether  it  operates  as  a  library  or 
service,  and  integrating  the  client-side  detection  algorithms  into  existing  mal¬ 
ware  detection  schemes  or  operating  its  sensors  as  its  own  detection  framework 


is  equally  as  uncomplicated.  Because  both  systems  are  stand-alone,  there  is  no 
need  to  have  the  systems  interoperate.  Indeed,  a  user  that  has  the  client-side  so¬ 
lution  installed  while  using  a  social  network  that  employs  the  server-side  solution 
gains  the  benefits  of  both. 

Limitations.  Even  when  our  classifier  is  utilized  by  a  social  network  provider 
and  a  machine  has  our  client  solution  installed,  using  both  still  has  some  lim¬ 
itations.  Specifically,  if  the  botmaster  employs  steganography  into  their  social 
network-based  C&C,  the  server-side  solution  in  its  current  form  will  not  detect 
that  message  being  passed.  Employing  steganography  in  such  a  way  will  diffuse 
the  content  in  the  message,  essentially  expanding  the  text  [4].  In  this  case,  if 
using  popular  social  networks  like  Twitter.com  or  FaceBook.com  with  character 
length  limitations,  spreading  a  command  over  multiple  messages  would  likely 
be  required.  More  specifically,  our  approach  can  benefit  from  host-centric  and 
network-centric  approaches  as  follows. 


A  bot  that  reads  steganographic  commands  and  can  evade  our 
client-side  sensors.  One  way  for  a  bot  to  evade  the  client-side  sensors  is  to 
exist  at  the  kernel  level.  Since  some  of  the  client-side  solution  sensors  exist  at 
the  user-level,  the  bot  can  effectively  bypass  enough  of  these  sensors  to  mask 
its  presence  on  the  machine.  Additionally,  a  bot  that  with  intimate  knowl¬ 
edge  of  the  implementation  details  of  the  client-side  sensors  can  maneuver 
around  our  countermeasures,  such  as  writing  code  that  falsifies  sensor  data. 
A  host-centric  approach  to  capture  additional  anomalous  information  at  the 
kernel  level  would  help  mitigate  this  attack. 

—  A  bot  that  reads  steganographic  commands  and  masquerades  as 
a  benign  process.  A  bot  that  behaves  as  a  benign  process  would  have  to 
lack  the  self-concealing  or  unreliable  provenance  attribute.  By  masquerading 
as  a  benign  application,  say  by  presenting  itself  as  a  graphical  application 
that  masks  its  true  purpose,  a  bot  could  exist  with  such  an  interface.  This 
bot  would  additionally  have  to  trick  the  user  into  starting  it  and  keeping 
it  running,  which  might  prove  difficult.  To  avoid  possessing  an  unreliable 
provenance,  this  bot  would  have  to  have  a  digital  signature,  which  is  diffi¬ 
cult  to  forge.  Additionally,  it  must  not  dynamically  inject  code  into  another 
source  or  replicate  itself,  which  are  hallmark  signs  of  bots,  since  they  wish 
to  inculcate  themselves  into  the  host  machine.  A  network-centric  solution 
is  necessary  to  analyze  network  layer  data  for  similar  events  occurring  from 
many  machines  in  a  network  during  a  small  timeframe. 

—  A  bot  that  reads  steganographic  commands  and  runs  scripts.  A  bot 
that  behaves  as  a  social  network-based  bot  but  downloads  text  files  instead 
of  executables  will  not  be  classified  as  a  social  network-based  bot,  although  it 
would  be  marked  as  a  possibly  suspicious  bot.  If  the  bot  contains  or  is  aware 
of  a  scripting  engine  such  as  a  Python  interpreter,  the  bot  can  run  the  script 
instead  of  an  executable.  A  host-centric  approach  to  contain  general  purpose 
malware  or  prevent  or  alert  the  user  of  script /program  execution  would  help 
stop  this  attack.  Additionally,  a  network-centric  strategy  to  detect  script 


file  downloads  would  help  prevent  the  scripts  from  being  downloaded  to  the 
client  machine. 


8  Conclusion  and  Future  Work 

We  systematically  studied  a  social  network-based  botnet  and  its  C&C  and  dis¬ 
cussed  their  future  evolutions.  We  investigated,  prototyped,  and  analyzed  both 
server-side  and  client-side  countermeasures,  which  are  integral  parts  of  a  solution 
to  the  emerging  threat  of  social  network-based  botnets.  We  also  discussed  how 
our  solution  can  benefit  from  host-centric  and  network-centric  botnet  detection 
solutions  so  as  to  formulate  a  comprehensive  defense  against  botnets. 

Our  future  work  includes:  (1)  implementing  the  client-side  countermeasures 
as  real-time  detection  systems,  (2)  improving  the  server-side  classifier  to  detect 
steganography,  (3)  handling  multiple  stepping  stones  in  payload  redirection,  and 
(4)  and  porting  the  client-side  countermeasures  to  other  platforms. 
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Abstract.  In  this  paper  we  address  the  following  questions:  From  a  networking 
perspective,  do  malicious  programs  (malware,  bots,  viruses,  etc...)  behave  dif¬ 
ferently  from  benign  programs  that  run  daily  for  various  needs?  If  so,  how  may 
we  exploit  the  differences  in  network  behavior  to  detect  them?  To  address  these 
questions,  we  are  systematically  analyzing  the  behavior  of  a  large  set  (at  the  mag¬ 
nitude  of  2,000)  of  malware  samples.  We  present  our  initial  results  after  analyzing 
1000  malware  samples.  The  results  show  that  malicious  and  benign  programs  be¬ 
have  quite  differently  from  a  network  perspective.  We  are  still  in  the  process  of 
attempting  to  interpret  the  differences,  which  nevertheless  have  been  utilized  to 
detect  31  malware  samples  which  were  not  detected  by  any  antivirus  software 
on  Virustotal.com  as  of  01  April  2010,  giving  evidence  that  the  differences  be¬ 
tween  malicious  and  benign  network  behavior  has  a  possible  use  in  helping  stop 
zero-day  attacks  on  a  host  machine. 


1  Introduction 

The  ever  growing  sophistication  of  malware,  especially  zero-day  attacks,  with  faster 
distribution  and  stealthier  execution  has  forced  signature  based  detection  in  an  uphill 
battle  that  is  difficult  to  win.  Behavior  based  detection  is  increasingly  being  used  by 
commercial  software  vendors  with  some  success  but  is  partially  reliant  on  understand¬ 
ing  the  behavior  of  known  malware  to  attempt  detecting  future  attacks. 

This  research  analyzes  known  malicious  and  benign  samples  in  an  attempt  to  exploit 
differences  in  their  network  behavior  to  accomplish  accurate  behavior  based  malware 
detection.  The  data  set  consisted  of  1000  malware  samples,  including  31  not  detected 
by  any  antivirus  software  on  Virustotal.com  on  01  April  2010  and  123  benign  samples. 
The  analyzed  data  included  DNS,  NetBIOS,  TCR  UDP,  ICMP  and  other  network  traf¬ 
fic.  For  each  analyzed  malware  and  benign  sample,  we  collected  occurrence  amounts 
of  basic  network  functions  such  as  total  number  of  DNS  queries  and  NetBIOS  query 
requests.  Observations  of  captured  network  activity  and  occurrence  amounts  were  an¬ 
alyzed  and  correlated  to  identify  network  behaviors  occurring  mostly  in  malware.  We 
use  clustering  and  classification  algorithms  to  evaluate  how  effectively  our  observed 
network  behaviors  can  differentiate  malware  from  benign  samples. 


Given  our  observed  network  behaviors,  our  clustering  and  classification  produced 
minimal  false  positives  and  false  negatives.  In  addition,  31  malware  samples  not  identi¬ 
fied  by  any  antivirus  software  on  Virustotal.com  on  01  April  2010  were  correctly  clus¬ 
tered  and  classified  using  our  observed  network  behaviors.  These  results  give  evidence 
that  the  observed  differences  between  malicious  and  benign  network  behavior  can  be 
useful  in  stopping  zero-day  attacks  on  a  host  machine. 

The  principal  contributions  of  this  research  are: 

1.  Identification  of  network  behaviors  occurring  mostly  in  malware  usable  in  behavior 
based  malware  detection. 

2.  Discovery  of  novel  malicious  uses  of  network  services  by  malware. 

3.  Evaluating  the  effectiveness  of  observed  network  behaviors  in  identifying  malware 
and  benign  processes  with  clustering  and  classification. 

This  research  presents  early  results  of  one  perspective  of  an  ongoing  project  deal¬ 
ing  with  malware  behavior  based  on  a  sample  size  of  1000  malware  and  41  benign 
processes.  The  benign  processes  were  executed  three  times  each  for  a  total  of  123  in¬ 
stances  which  were  used  as  samples  for  our  analysis.  The  goal  of  this  ongoing  research 
is  a  real  time  behavior  based  malware  detection  system  incorporating  several  perspec¬ 
tives  capable  of  detecting  known  and  unknown  malware  on  host  machines. 

The  rest  of  this  paper  is  organized  as  follows:  Section  2  gives  related  work.  Section 
3  describes  our  data  set.  Section  4  presents  our  network  behaviors.  Section  5  gives 
our  clustering  and  classification  results.  Section  6  discusses  our  approach  and  results, 
Section  7  gives  limitations  and  Section  8  is  conclusions  and  future  work. 

2  Related  Work 

The  research  of  Bayer  et.  al.  [3]  presents  a  dynamic  malware  analysis  platform  called 
Anubis  which  is  used  to  collect  behaviors  of  known  malware  samples  in  a  controlled  en¬ 
vironment.  This  system  inputs  a  binary  executable  and  records  API  invocation,  network 
activity,  and  data  flows.  The  results  are  used  to  report  observed  behaviors  principally 
on  the  file  system,  registry,  and  network  activity.  The  reported  network  activity  only 
provides  data  on  usage  of  protocol  traffic,  connections  to  remote  servers,  file  down¬ 
loads,  port  scanning  and  other  typical  network  tasks.  The  results  give  direction  as  to 
which  forms  of  network  activity  should  be  monitored  closely  for  malicious  events.  Us¬ 
ing  Bayer  et.  al.  as  motivation,  we  analyzed  and  produced  occurrence  amounts  of  basic 
network  functions  which  were  used  to  aid  in  defining  our  set  of  network  behaviors. 

Malware  binary  analysis  platforms  such  as  Anubis  [1],  Malheur  [13],  Bitblaze  [4] 
and  CwSandbox  [24]  are  designed  primarily  to  run  known  malware  samples  in  a  con¬ 
trolled  environment  and  record  execution  behavior.  Recording  is  done  via  various  tech¬ 
niques  from  API  hooking  to  monitoring  modifications  and  data  flows  in  various  OS 
components.  These  platforms  record  general  network  activity  behavior  which  are  re¬ 
ported  to  the  user.  The  reports  do  not  include  sufficient  detailed  information  to  identify 
malware’s  precise  implementation  and  use  of  network  services  making  it  difficult  to  dis¬ 
cover  novel  malicious  acts  captured  in  network  activity.  Our  research  fills  this  gap  by 


capturing  finely  grained  network  behavior  facilitating  detailed  analysis  which  was  key 
in  our  discovery  of  novel  network  behaviors  that  successfully  detected  several  malware 
samples. 

The  research  presented  by  Morales  et.  al.  [15]  analyzes  a  specific  form  of  network 
activity  behavior  called  RD-behavior  which  is  based  on  a  combination  of  DNS  activity 
and  TCP  connection  attempts.  The  authors  found  bot  processes  often  use  reverse  DNS 
queries  (rDNS)  possibly  to  harvest  new  domain  names.  The  rDNS  often  fails  and  is  then 
followed  by  a  TCP  connection  attempt  to  the  input  IP  address  of  the  failed  rDNS,  the 
authors  regard  this  as  an  anomalous  behavior.  This  anomalous  behavior  is  successfully 
used  by  the  authors  to  detect  bots  and  non-bot  malware.  The  approach  in  [15]  was 
limited  when  the  authors  removed  one  of  their  defined  behavior:  a  failed  connection 
attempt  to  the  returned  IP  address  of  a  successful  DNS  query.  Our  results  revealed 
an  almost  total  absence  of  rDNS  usage  and  several  instances  where  malware  used  the 
removed  behavior.  Using  this  behavior  helped  raise  our  detection  accuracy. 

The  research  of  Zhu  et.  al.  [28]  detected  bots  with  a  host-based  technique  based 
on  high  number  of  failed  connection  attempts.  Measuring  the  connection  failure  rate  of 
bots  and  benign  processes  showed  that  successful  bot  detection  is  achievable  using  only 
this  metric.  Measuring  failed  connection  attempts  may  only  be  effective  with  bots  that 
are  totally  or  partially  inactive  while  fully  active  up  to  date  bots  and  other  malware  with 
little  or  no  failed  connection  attempts  may  go  undetected  by  this  approach.  Our  research 
relates  failed  connection  attempts  with  DNS,  NetBIOS  and  other  network  behaviors 
creating  a  more  robust  approach  to  malware  analysis  and  detection. 

A  broad  corpus  of  research  exists  analyzing  and  detecting  malware  samples,  fami¬ 
lies  and  categories  [8, 17, 11,9,22,6, 14, 12, 16, 18,7,2,23].  All  use  different  perspec¬ 
tives  to  measure,  analyze  and  detect  malware  using  host-based,  network-based  and  hy¬ 
brid  approaches.  Our  research  enhances  the  current  literature  by  relating  different  spe¬ 
cific  network  activities  together  to  define  network  behaviors  mostly  used  by  malware. 

3  Data  Set  Analysis 

Our  analysis  is  based  on  1000  known  malware  samples  and  41  benign  samples.  The 
benign  samples  were  executed  three  times  each  for  a  total  of  123  instances  which  were 
used  as  samples  for  our  analysis.  The  malware  samples  were  acquired  by  downloading 
the  first  969  samples  from  the  CWSandbox  sample  feed  on  27  October  2009  [24].  The 
upload  date  was  arbitrarily  chosen.  The  set  contains  a  broad  range  of  malware  types 
including:  bots,  backdoors,  malware  downloaders,  keyloggers,  password  stealers  and 
spyware  amongst  others.  Table  1  lists  prominent  malware  in  the  data  set.  Uploading 
the  MD5  sums  to  Virustotal.com  provided  malware  names  from  Kaspersky,  McAfee 
and  Symantec.  We  also  downloaded  31  malware  samples  from  the  31  March  2010  up¬ 
load  on  CWSandbox  malware  repository.  These  31  were  chosen  because  their  MD5 
sums,  listed  in  Table  2,  were  reported  as  undetected  by  all  antivirus  software  used  by 
Virustotal.com  on  01  April  2010  and  we  were  capable  of  executing  and  capturing  their 
network  behavior  in  our  testing  environment.  The  majority  of  our  malware  samples  had 
successful  network  activity  during  the  collection  period  connecting  with  remote  hosts 
and  conducting  malicious  deeds. 


j  Prominent  malware  samples  In  data  set  j 

Downloaders 

Bots 

Worms 

Hybrids 

Bifrose.bmzp 

Koobface.d 

Iksmas.bqs 

Krap.n 

PcClient.ahqy 

Padobot.m 

Mydoom.m 

PolyCrypt.b 

Poison.pg 

Virut.by 

Allaple.a 

Refroso.qj 

Turkojan.il 

Zbot.acnd 

Bacteraloh.h 

Scar.hez 

Genome.cehu 

Buzus.amsz 

Palevo.ddm 

CodecPack.ill 

Lipler.fhm 

Adware 

Scareware 

Rootkits 

Viruses 

FenomenGame 

SystemSecurity.cc 

Tdss.f 

Sality.aa 

BHO.nby 

XpPoliceAV.apd 

Monderd.gen 

|  Benign  samples  in  data  set  j 

Adobe  Reader 

Ares 

Avant 

BitTorrent 

Chrome 

CuteFtp 

DeskTube 

Facebook 

Desktop 

FileZilla 

FireFox 

FlickRoom 

Flock 

Google  Talk 

Google  Update 

IE  explorer 

Kaspersky 

Security 

K-Meleon 

LimeWire 

Ping 

PPLive 

PPStream 

RSSBandit 

Skype 

Snarfer 

Snitter 

SopCast 

Spyware  Dr. 

Stream  Torrent 

Streamer  radio 

TortoiseSVN 

Traceroute 

TVants 

Tvkoo 

TVUPlayer 

TweetDeck 

Twhirl 

uTorrent 

UUSee 

Win  Player 

Win  Update 

Zultrax 

Table  1.  Prominent  malware  and  benign  samples  in  data  set. 


The  benign  test  set,  also  listed  in  Table  1,  covered  a  wide  range  of  popular  and  daily 
used  network  active  applications  including:  web  browsers,  FTP  clients,  RSS  readers, 
social  network  clients,  antivirus  software,  Peer-to-Peer  (P2P)  clients  and  standard  net¬ 
work  tools  amongst  others.  We  captured  network  activity  in  VMWare  Workstation  with 
Windows  XP  SP2  using  Windows  Network  Monitor  along  with  proprietary  network 
layer  monitors  to  record  the  network  activity  for  an  execution  period  of  10  minutes  for 
each  data  set  sample.  The  individual  samples  were  manually  executed  one  at  a  time  in 
VMWare  Workstation  with  our  monitors  collecting  all  network  traffic  and  the  captured 
data  was  saved  to  a  local  repository  for  analysis.  The  benign  processes  were  installed, 
used  under  normal  conditions  and  updated  (when  available)  during  testing. 

The  network  activity  of  the  group  of  969  malware  samples  was  collected  between  27 
October  2009  and  01  November  2009,  the  network  activity  of  the  group  of  31  malware 
samples  was  collected  on  01  April  2010,  and  the  network  activity  of  the  group  of  41 
benign  samples  was  collected  between  01  April  2010  and  03  April  2010.  Collecting 
network  behavior  of  the  malware  samples  was  done  immediately  after  downloading 


31  malware  not  detected  on  Virustotal.com  -  01  April  2010 

732e014e309ffab8ed9a05198d060a0b 

ce  1  cd3  809 1 0e28092f880643ec  1  f 809d 

940044 13 140e2022c0880f3828alc0ee 

cbed573de  1 8b900cd9 1  cc9e455  8fb645 

bcebf38 1  a36099f697d2e00f3ec4f26e 

7a84fd3ff0aa487ae2142e7 1 30c78d9f 

2fbea 1 82c4c7 d474 1 9b2c25d7  2eb64bc 

6d25e4a5db 1 30cda772e09d458afacad 

8a98 1 76d289e099ccf359aaed06daf9e 

bdd7bd56d65471b594c0822dd434a84f 

037629b54b5714457ff2abefdab0c349 

6b24b3779730f4add8d562daalbc0ddf 

7407c24fl7d7c582901623c410ab7a91 

8 1 89e6f967b612e5ee7a74981278de4a 

36a256686620fa7d3b9433afl9cf57a2 

5cfb57eac56c8639329d9ecab7b7f4ac 

cde  1 7b3c02d6 1 43a9c  1  fa22eedad65ac 

fbc377f70 1 0b6a32 1 6f7  fd330dcfe69e 

2e3108689a758c629286ef552e89b858 

Ob  15d6658f306cfea3fe20bd32c9 1  aOd 

ae7d5ad00 1  c26bbda2f326 10f28484b9 

9207e79elf2191d3d44343482ab58a4e 

25 1 8 Ic8ed97357b52ea775bc5dca353c 

2bbb004cc926a071bda327ca83bf03fb 

b0c895 19569ce2e3 10958af0e5932edl 

e73da6feae4fabd25  lbb  19f39c  1  a36d3 

d2ebbc7609672d46e7bb8b233af585aa 

e38c4a027b5a570eae8c57de8e26fcbb 

bc8aa3e072fbec4045bf94375ac53be9 

018197ab7020625864e6f4ff6561  lfc7 

5dae2c8bf87e6a9ad44e52d38ac34 1 1  e 

Table  2.  MD5  sums  of  data  set  malware  samples  not  detected  on  VirasTotal.com 


the  samples  to  assure  the  samples  were  still  active,  meaning  the  malware  would  still 
connect  with  remote  hosts  and  conduct  malicious  deeds  producing  network  traffic.  The 
vast  majority  of  our  malware  samples,  over  95%,  produced  network  traffic  which  was 
the  basis  of  our  analysis. 

4  Network  Behavior 

This  research  analyzes  known  malware  and  benign  samples  in  an  attempt  to  exploit 
differences  in  their  network  behavior  to  accomplish  accurate  behavior  based  malware 
detection.  Differences  in  network  behavior  were  identified  through  manual  post  analysis 
of  collected  network  traffic.  The  captured  network  activity  of  our  data  set  contained 
typical  protocols  such  as  TCP,  UDP,  and  DNS  but  they  were  not  always  used  in  the 
normal  expected  way,  most  notably  in  our  malware  samples.  We  were  able  to  collect 
occurrence  totals  of  basic  network  functions  and  correlate  together  different  occurrence 
amounts  of  specific  network  activity  to  identify  network  behaviors  which,  according 
to  our  results,  occurred  more  often  in  malware  than  benign  samples.  The  identified 
network  behaviors,  defined  as  Bn  where  n  is  an  identification  number,  are  described 
below. 

4.1  DNS  and  NetBIOS 

The  Domain  Name  System  (DNS)  and  Network  Basic  Input/Output  System  (NetBIOS) 
provide  services  to  acquire  IP  addresses  when  a  domain  name  is  provided  and  vice  versa 
[5, 19].  Coarse-grain  occurrence  amounts  of  both  protocols  by  known  malware  has  been 
previously  shown  [3, 15].  Table  3  summarizes  our  occurrence  amounts  for  DNS  queries, 


reverse  DNS  queries  and  NetBIOS  name  requests.  The  analysis  revealed  100%  of  be¬ 
nign  processes  and  77%  malware  issuing  DNS  queries  mostly  due  to  malware’s  use 
of  other  network  services,  such  as  NetBIOS  and  ICMP,  to  acquire  IP  addresses  for 
connection  attempts.  The  benign  samples  with  failed  DNS  queries  were  web  browsers 
unable  to  reach  third  party  content  and  P2P  video  and  audio  streamers  unable  to  locate 
remote  hosts  for  a  specific  stream.  Several  malware  samples  had  failed  DNS  queries, 
most  were  domain  names  of  malware  servers  that  were  either  not  active  or  previously 
discovered  and  shut  down.  Reverse  DNS  queries  (rDNS)  were  notably  absent  with  only 
2%  of  malware  and  no  benign  samples.  This  contradicts  the  findings  of  [15]  which 
documented  hots  and  non-bot  malware  performing  rDNS  and  conjectured  these  queries 
were  an  essential  component  to  establish  malicious  network  activity.  It  can  be  inferred, 
from  testing  our  samples,  that  the  current  generation  of  malware  may  possibly  be  less 
reliant  on  rDNS  in  favor  of  other  techniques  providing  the  same  IP  address  and  domain 
name  related  information. 

Analyzing  the  occurrence  totals  of  NetBIOS  name  requests  (NBTNS)  revealed  56% 
of  malware  and  4%  of  benign  samples  implemented  this  activity.  The  benign  sam¬ 
ples  with  NetBIOS  name  requests  were  the  web  browsers  Google  Chrome  with  fifteen 
name  requests  and  Firefox  with  six  name  requests.  Further  analysis  revealed  the  domain 
names  used  in  the  NetBIOS  name  request  of  Google  Chrome  and  Firefox  had  first  been 
used  in  a  DNS  query  with  some  failing  and  others  succeeding.  The  malware  samples 
revealed  two  distinct  forms  of  NetBIOS  name  request  usage:  (i)  expected  usage,  same 
as  benign,  and  (ii)  performing  NetBIOS  name  requests  on  domain  names  that  were 
not  part  of  a  captured  DNS  or  rDNS  query.  To  our  knowledge,  the  second  form  is  a 
novel  observation  of  NetBIOS  use  by  malware  not  presented  in  previous  research.  Of 
the  1000  malware  samples,  49%  exhibited  the  second  NetBIOS  usage  described  here. 
We  concluded  this  was  a  network  behavior  occurring  mostly  in  malware  and  usable  for 
detection.  Based  on  this,  we  define  the  following  network  behavior: 

—  B i :  A  process  performs  a  NetBIOS  name  request  on  a  domain  name  that  is  not  part 

of  a  DNS  or  rDNS  query. 

Table  3  shows  B\  occurring  only  in  malware,  with  49%.  Using  online  malware 
databases  such  as  MalwareURL.com,  we  found  many  domain  names  used  by  our  mal¬ 
ware  samples  in  B\  identified  as  malware  servers,  but  several  other  domains  did  not 
show  up  leading  us  to  believe  they  were  recently  created  and  registered,  inactive,  had 
avoided  detection,  were  infected  hosts,  or  newly  activated  servers.  We  conjecture  mal¬ 
ware  uses  behavior  B\  in  an  attempt  to  acquire  remote  host  information  while  avoiding 
detection  by  anti-malware  that  may  not  monitor  NetBIOS  but  most  probably  does  mon¬ 
itor  DNS. 

4.2  RD-behavior 

This  network  behavior  as  originally  defined  [15]  was  primarily  based  on  frequent  usage 
of  reverse  DNS  queries  (rDNS)  by  hots.  The  authors  defined  four  network  behavior 
paths  of  which  three  included  rDNS.  Their  results  implied  rDNS  combined  with  TCP 
connection  attempts  was  sufficient  to  detect  malware  and  eliminated  false  positives  by 


Samples 

with 

Malware 
1000  samples 

Benign 
123  samples 

DNS  queries 

77% 

100% 

Reverse  DNS 
queries 

2% 

0% 

NetBIOS 
name  requests 

56% 

4% 

Behavior  B 1 

49% 

0% 

Table  3.  Samples  with  DNS,  NetBIOS,  &  73 1 


omitting  the  only  behavior  path  dealing  solely  with  DNS  queries.  Our  analysis  revealed 
a  notable  absence  of  rDNS  and  a  high  occurrence  of  DNS  queries,  see  Table  3,  many  of 
which  exhibited  the  omitted  behavior.  We  conjecture  better  detection  can  be  achieved 
by  including  all  four  behaviors  from  [15]  redefined  as  follows: 

-  Ij2'-  Failed  connection  attempt  to  an  IP  address  obtained  from  a  successful  DNS 
query. 

-  B3:  Failed  connection  attempt  to  the  input  IP  address  of  a  successful  rDNS  query. 

-  /3  [ :  Connection  attempt  to  the  input  IP  address  of  a  failed  rDNS  query. 

In  [15]  behavior  path  P5  is  defined  as:  A  successful  connection  to  an  IP  address  used 
in  a  failed  rDNS  query  and  behavior  path  1\  is  defined  as:  A  failure  to  connect  with 
an  IP  address  used  in  a  failed  rDNS  query.  We  reduced  the  number  of  network  behav¬ 
iors  by  combining  behavior  paths  P5  and  l\,  into  one  network  behavior  B] .  Behavior 
f?2  implies  a  successful  connection  should  occur  to  IP  addresses  obtained  in  successful 
DNS  queries,  a  failed  connection  attempt  indicates  something  is  not  right  and  should 
be  investigated.  Malware  can  exhibit  this  behavior  when  domain  names  have  been  shut 
down  or  taken  offline  and  their  DNS  records  have  not  been  updated  or  removed.  Behav¬ 
ior  / 3  ’  has  the  same  implication  as  B>  but  with  the  input  IP  address  of  rDNS  queries. 
Behavior  B4  is  assumed  to  only  occur  in  malware.  We  assuem  an  input  IP  address 
failing  an  rDNS  query  as  unreachable  and  should  not  be  used  for  connection  attempts. 
Table  4  shows  total  number  of  processes  with  behaviors  B2,  B:i  and  B4.  Our  occurrence 
amounts  showed  21%  of  malware  and  no  benign  samples  with  II2  and  no  occurrences 
of  B3  and  /3  j  due  to  very  low  rDNS  usage.  These  results  imply  rDNS  may  be  used  less 
often  by  malware  in  favor  of  other  techniques  providing  the  same  information  in  a  more 
clandestine  manner. 

4.3  UDP  and  ICMP 

Traffic  between  local  and  remote  hosts  using  captured  User  Datagram  Protocol  (UDP) 
[25]  did  not  serve  a  significant  role,  except  for  DNS  and  rDNS,  in  our  analysis  due 
to  similar  occurrence  amounts  of  network  activity  in  both  malware  and  benign.  Previ¬ 
ous  research  [3]  has  documented  coarse-grain  UDP  occurrence  amounts  by  malware, 
but  does  not  include  a  comparison  with  benign  processes.  Identifying  network  activity 
behaviors  in  the  UDP  protocol  is  part  of  our  ongoing  research. 


Samples 

Malware 

Benign 

with 

1000  samples 

123  samples 

Behavior  B2 

21% 

0% 

Behavior  B3 

0% 

0% 

Behavior  B4 

0% 

0% 

Table  4.  Samples  with  behaviors  B2,  B3  &  B4 


The  occurrence  amounts  of  Internet  Control  Message  Protocol  (ICMP)  [10]  activ¬ 
ity,  which  focused  on  ICMP  echo  requests  and  replies,  revealed  an  elevated  usage  by 
the  malware  samples  in  comparison  to  the  benign  samples.  Further  analysis  concluded 
that  malware  was  using  ICMP  echo  requests  in  the  same  manner  as  the  Ping  network 
utility  [20]  to  decide  if  a  remote  host  was  reachable,  thus  being  a  candidate  for  a  con¬ 
nection  attempt.  Malware  use  of  ICMP  has  been  previously  observed  [27]  but  was  not 
distinguished  as  a  behavior  frequently  used  by  malware  in  comparison  to  benign.  Our 
analysis  showed  malware  never  attempted  connections  to  IP  addresses  not  receiving  a 
reply  to  an  ICMP  echo  request  and  almost  always  attempted  to  connect  with  IP  ad¬ 
dresses  that  did  have  a  successful  reply.  Furthermore,  the  input  IP  address  of  the  echo 
requests  were  never  part  of  a  DNS  or  rDNS  query  or  NetBIOS  name  request  leading 
to  conclude  these  IP  addresses  were  hardwired,  dynamically  generated,  or  downloaded 
from  a  malware  server.  Based  on  these  observations,  we  define  two  network  behaviors 
as  follows: 

-  B5:  ICMP  only  activity,  ICMP  echo  requests  for  a  specific  non-local  network  IP 
address  with  no  reply  or  a  returned  error  message. 

-  /if,:  TCP/ICMP  activity,  TCP  connection  attempts  to  non-local  IP  addresses  that 
received  a  successful  reply  to  their  ICMP  echo  requests. 

We  assume  the  IP  addresses  used  in  JB5  and  /if,  are  never  part  of  DNS,  rDNS  or 
NetBIOS  activity.  This  assumption  is  supported  by  our  observations  of  the  captured 
network  activity.  The  results  of  this  analysis  are  listed  in  Table  5.  B5  occurred  more  of¬ 
ten  in  benign  than  malware  but  the  benign  samples  also  used  ICMP  less  than  malware, 
perhaps  favoring  other  similar  and  more  conventional  services  such  as  DNS  queries, 
see  Table  3.  Bq  was  exhibited  in  11%  of  malware  and  only  2%  benign  samples.  This 
supports  our  claim  that  malware  frequents  ICMP  use  to  identify  IP  addresses  for  con¬ 
nection  attempts.  Our  observations  of  and  Bq  are,  to  our  knowledge,  novel  in  the 
literature  not  being  previously  reported. 

4.4  Other  network  activity 

This  encapsulates  other  less  occurring  activities  which  were  considered  significant  since 
they  rarely  occurred  in  any  of  our  data  set  samples  or  were  implemented  in  a  non- 
conventional  way.  We  consider  these  network  activities  to  be  anomalous  and  not  nec¬ 
essarily  malicious  behaviors.  The  value  of  recording  occurrences  of  these  behaviors  is 
in  cases  where  a  novel  and  never  before  observed,  or  rarely  used  malicious  behavior 
occurs  in  a  malware  sample.  We  encompass  this  idea  with  the  following  behavior: 


Samples 

with 

Malware 
1000  samples 

Benign 
123  samples 

Behavior  B5 

3% 

4% 

Behavior  B§ 

11% 

2% 

Table  5.  Samples  with  behaviors  B$  &  Bq 


Samples 

with 

Malware 
1000  samples 

Benign 
123  samples 

TCP  connection  attempts  to  IP  addresses 
never  used  in  DNS,  NetBIOS,  ICMP 

10% 

2% 

Listen  connections  on 
non-typical  port  numbers 

2% 

7% 

Successful  DNS  queries  returning 
local  network  IP  addresses 

1% 

0% 

Use  of  non- typical  network 
protocols  and  commands 

4% 

0% 

Behavior  B7 

18% 

9% 

Table  6.  Samples  with  behavior  Bj 


-  B 7:  Network  activity  that  is  rarely  occurring  or  implemented  in  an  anomalous  man¬ 
ner. 

Table  6  lists  the  amount  of  samples  exhibiting  the  different  types  of  observed  net¬ 
work  activity  and  H-.  TCP  connection  attempts  to  IP  addresses  which  were  not  part  of 
DNS,  NetBIOS  or  ICMP  activity  were  the  most  prominent  in  this  group  with  10%  in 
malware  and  only  2%  in  benign.  These  malware,  upon  initial  execution,  immediately 
attempted  connections  to  IP  addresses  ranging  from  a  few  to  over  one  hundred  differ¬ 
ent  addresses  which  appeared  to  have  been  hardwired  or  dynamically  generated.  The 
benign  sample  with  this  activity  was  the  video  chat  program  Skype  which  connected  to 
a  server  during  installation. 

Second  most  prevalent  network  activity  was  use  of  non-typical  protocols  and  net¬ 
work  commands  with  4%  in  malware  none  in  benign.  The  malware  attempted  connec¬ 
tions  using  either  FTP  or  SMB  or  RTCP  These  were  the  only  samples  from  our  data  set 
using  these  protocols  except  for  FTP  which  is  a  typical  protocol;  the  reason  we  docu¬ 
mented  FTP  usage  is  the  malware  had  a  very  small  amount  of  FTP  activity  download 
from  a  remote  server  along  with  a  much  lager  amount  of  TCP  and  UDP  traffic. 

One  malware  sample  used  the  authentication  system  KerebosV5  and  one  other  mal¬ 
ware  sample  used  the  network  command  suite  Andx.  Interestingly,  the  Andx  commands 
were  attempting  to  authenticate  and  access  local  network  IP  addresses  in  search  of  a 
file  server  perhaps  to  host  inappropriate  content.  Listening  TCP  connections  using  non¬ 
typical  port  numbers  occurred  in  2%  malware  and  7%  benign  samples.  Malware  listened 
on  non-typical  or  private  ports  [21]  such  as  port  numbers;  19178,  24450,  25254,  27145 
and  36975;  benign  also  listened  on  non-typical  or  private  ports  such  as  port  numbers; 


19396,  33680,  36363  and  58480.  Two  malware  samples  performed  successful  DNS 
queries  on  domain  names  returning  local  network  IP  addresses:  gogog029. 100webspace.net 
-  127.0.0.1  and  probooter2009.no-ip.org  -  0.0. 0.0.  It  is  unclear  if  these  DNS  query  re¬ 
sults  were  modified  by  the  malware  or  if  these  were  intentionally  returned  by  the  DNS 
server.  /j7  was  exhibited  in  18%  malware  and  9%  benign,  suggesting  rarely  or  anoma¬ 
lous  occuring  network  activity  may  be  useful  in  differentiating  malware  and  benign. 

5  Clustering  and  Classification 

To  evaluate  how  effectively  our  observed  network  behaviors  can  differentiate  between 
malicious  and  benign  samples,  we  input  the  data  through  clustering  and  classification 
algorithms  using  the  Weka  data  mining  software  [26].  Clustering  and  classification  al¬ 
gorithms  are  extensively  used  in  the  literature  to  evaluate  proposed  host,  network  and 
hybrid  detection  approaches  and  are  well  established  as  accurate  indicators  of  effec¬ 
tiveness  and  efficiency  of  a  proposed  detection  approach.  Our  data  set  consisted  of  the 
occurrence  amounts  of  network  behaviors  ll\  through  /I7,  discussed  in  Section  4,  for 
each  malware  and  benign  sample.  The  complete  data  set  was  used  for  clustering;  for 
classification,  the  training  set  contained  the  first  700  malware  samples  and  40  benign 
with  the  test  set  containing  the  remaining  samples.  The  31  undetected  malware  samples 
were  not  part  of  the  training  set.  Some  of  the  samples  in  the  test  set  not  found  in  the 
training  set  are  listed  in  Table  7. 


Malware  samples 

Benign  samples 

BHO.nby 

Adobe  Reader 

Mabezat.b 

BitTorrent 

Monderd.gen 

Chrome 

Poison.pg 

CuteFtp 

Swizzor.a  (2) 

Facebook  Desktop 

Turkojan.il 

FlickRoom 

VB.bfo 

Kaspersky  Security 

VB.vr 

Skype 

3 1  undetected  malware 

SopCast 

TVants 

Table  7.  Some  of  the  malware  and  benign  samples  in  test  set  and  not  in  training  set 


5.1  Clustering  results 

The  data  set  was  input  to  the  complete  suite  of  clustering  algorithms  in  Weka.  The  top 
three  results  are  listed  in  Table  8.  False  positives  and  false  negatives  were  determined 
by  observing  if  the  majority  of  a  cluster  was  composed  of  malware  or  benign  samples. 
If  malware  was  the  majority  then  the  benign  samples  were  classified  as  false  positives; 
if  benign  was  the  majority  then  the  malware  samples  were  classified  as  false  negatives. 


Clustering 

algorithm 

Number  of 
clusters 

True 

positives 

True 

negatives 

False 

positives 

False 

negatives 

FP 

rate 

FN 

rate 

DBScan 

8 

119 

1000 

4 

0 

0.4% 

0% 

Expectation 
maximization  (EM) 

4 

123 

988 

0 

12 

0% 

1% 

Xmeans 

3 

123 

1000 

0 

0 

0% 

0% 

Table  8.  Top  three  clustering  results  with  1000  malware  and  123  benign  samples 


DBScan  and  EM  algorithms  produced  encouraging  results  with  no  false  negatives 
in  the  first  and  no  false  positives  in  the  second  algorithm.  The  four  false  positives  pro¬ 
duced  by  DBScan  were  SopCast,  TVUPlayer,  UUsee  media  center,  and  TVants.  All  of 
these  are  video  streamers  whose  content  source  comes  from  several  IP  addresses  which 
are  constantly  changed  and  removed,  making  it  difficult  to  keep  up  to  date.  This  is 
very  similar  to  IP  addresses  used  by  malware  authors,  especially  in  botnets  [18],  which 
constantly  change  primarily  to  avoid  detection.  All  four  were  grouped  in  one  cluster 
with  many  different  classes  of  malware,  the  samples  in  this  cluster  exhibited  many  in¬ 
stances  of  behaviors  B\,  II2  and  Bj.  The  main  reason  why  the  four  false  positives  were 
grouped  in  this  cluster  was  due  to  having  between  3  and  8  instances  of  behavior  B-2- 
In  each  case,  we  attempted  to  access  several  video  streams.  Many  of  these  were  un¬ 
reachable  and  analyzing  the  network  activity  showed  the  failed  connection  attempts  to 
IP  addresses  of  successful  DNS  queries.  Further  investigation  into  these  IP  addresses 
revealed  they  were  temporary  video  content  servers  where  the  specific  video  streams 
were  no  longer  available.  The  IP  was  taken  offline  but  the  records  pointing  to  them  had 
not  been  removed  from  the  software’s  database  of  active  streamers. 


The  twelve  false  negatives  produced  by  the  EM  algorithm  consisted  of  nine  mal¬ 
ware  downloaders,  three  of  which  belong  to  the  packed. Win32. krap  family,  one  worm, 
one  hot  (koobface)  and  one  of  the  31  undetected  malware  samples  with  MD5  hash 
value  7407c24fl7d7c582901623c410ab7a91.  Three  samples:  koobface  and  two  mal¬ 
ware  downloaders  were  seemingly  inactive  having  no  successful  connection  attempts 
with  remote  hosts  and  only  four  samples  exhibited  at  most  a  single  instance  of  just  one 
of  the  following  behavior  symptoms:  Bi,  B2,  Bq,  B7.  The  small  amount  of  network 
behaviors  produced  by  these  malware  led  to  their  false  negative  production  since  their 
network  traffic  was  very  similar  to  the  benign  samples. 


The  Xmeans  algorithm  produced  no  false  positives  and  no  false  negatives,  with  all 
malware  grouped  in  two  clusters  and  benign  in  one  cluster.  The  31  undetected  malware 
samples,  see  Table  2,  were  correctly  clustered  by  both  Xmeans  and  DbScan  while  EM 
correctly  clustered  30  implying  our  network  behaviors  can  detect  malware  missed  by 
commercial  antivirus  software  and  may  be  usable  in  stopping  zero-day  attacks.  Overall, 
the  clustering  suggest  our  network  behaviors  are  capable  of  detecting  malware  with 
minimal  false  positives  and  false  negatives. 


5.2  Classification  results 


Several  classification  algorithms  were  applied  on  the  test  set  with  BayesNet,  NNge, 
Random  Forest  and  Rotation  Forest  producing  the  best  results  listed  in  Table  9.  The 
false  negative  rates  for  all  four  algorithms  were  low  ranging  from  0.6%  to  1%,  the 
false  positives  were  also  very  low  ranging  between  0%  to  2%.  All  the  algorithms  had 
the  same  two  malware  samples,  VB.vr  and  one  of  the  31  undetected  malware  (MD5 
hash  value  25181c8e<i97357652ea7756c5cica353c)  as  false  negatives.  Both  of  these 
malware  were  not  part  of  the  training  set,  exhibited  3  or  less  instances  of  behavior 
f?5  with  different  IP  addresses  and  had  successful  network  activity  with  remote  hosts 
whose  IP  addresses  were  acquired  through  successful  DNS  queries.  The  third  false  neg¬ 
ative  produced  by  BayesNet  was  one  of  the  31  undetected  malware  (MD5  hash  value 
c&ed573del8&900cd91cc9e4558/&645)  which  was  active,  had  two  instances  of  behav¬ 
ior  5.5  on  two  different  IP  addresses  and  was  not  in  the  training  set. 

TVants  and  SopCast  were  the  only  two  processes  flagged  as  false  positives.  These 
two  samples  were  also  clustered  as  false  positives.  The  reason  was  again  their  failed 
connection  attempts  to  IP  addresses  which  were  no  longer  online  hosting  a  video  stream 
thus  producing  instances  of  behavior  B2.  Only  one  of  the  31  undetected  malware  was 
flagged  as  false  negative  by  all  four  of  our  algorithms,  with  one  more  being  flagged 
by  BayesNet.  The  other  29  undetected  malware  were  all  correctly  classified  by  all  four 
algorithms.  This  result  further  confirms  the  capability  of  our  behaviors  and  occurrence 
amounts  to  detect  malware  not  detected  by  commercial  antivirus  software  and  gives 
further  evidence  to  their  use  in  helping  stop  zero-day  attacks.  Overall,  the  classification 
results  further  suggest  our  network  behaviors  can  correctly  classify  both  known  and 
unknown  malware. 


Classification 

algorithm 

False 

positives 

False 

negatives 

FP 

rate 

FN 

rate 

BayesNet 

1 

3 

1% 

1% 

NNge 

1 

2 

1% 

0.6% 

Random  forest 

0 

2 

0% 

0.6% 

Rotation  forest 

2 

2 

2% 

0.6% 

Table  9.  Top  four  classification  test  set  results  with  300  malware  and  83  benign  samples 


6  Discussion 

According  to  our  results  in  Section  4,  of  the  seven  defined  behaviors,  Bi  occurred  the 
most  in  the  malware  samples  with  49%  followed  by  B-2  with  21%  and  U-  with  18%. 
All  three  are  considered  behaviors  more  likely  to  occur  in  malware  than  in  benign  pro¬ 
cesses  with  B-j  initially  assumed  anomalous  and  not  necessarily  malicious.  Behaviors 
Bi,  B\ 5  and  Bq  are,  to  our  knowledge,  novel  observations  implemented  by  malware  to 


locate  active  remote  hosts  for  connection  attempts  and,  in  our  tests,  occurred  more  in 
malware  than  benign.  Behavior  li7  is  particularly  interesting  due  to  its  subjective  nature 
which  can  encapsulate  any  network  activity  considered  significant  and  rarely  occurring. 
Therefore  it  is  easy  to  add  activities  which  degrade  detection  accuracy.  A  knowledge 
expert  is  best  suited  to  compose  activities  which  comprise  this  behavior. 

Our  clustering  results  were  better  than  expected  with  perfect  results  in  the  case  of 
Xmeans,  implying  our  network  behaviors  are  capable  of  providing  accurate  malware 
detection.  Our  data  set  covered  a  wide  spectrum  of  known  malware  and  benign  classes 
and  was  able  to  train  our  classifiers  to  correctly  identify  the  majority  of  malware  in  the 
test  set  with  minimal  false  positives  and  false  negatives. 

The  most  interesting  aspect  of  the  results  was  the  highly  accurate  clustering  and 
classification  of  the  31  undetected  malware.  The  MD5  sums  of  all  31  samples  were  not 
detected  by  any  antivirus  software  on  Virustotal.com  on  01  April  2010  yet  our  testing 
correctly  identified  them  with  minimal  exceptions.  This  detection  accuracy  gives  strong 
evidence  that  our  behaviors  can  help  stop  zero-day  attacks  on  a  host  machine,  especially 
in  cases  where  signature-based  detectors  fail  to  identify  a  zero-day  attack. 

A  robust  detection  system  encompasses  several  malware  detection  perspectives. 
This  research  has  only  studied  one  of  these  perspectives,  network  activity,  in  a  behav¬ 
ior  based  way  to  avoid  implementing  a  detection  methodology  dependent  on  malware 
signatures.  Part  of  our  ongoing  research  is  to  combine  our  findings  of  the  network  ac¬ 
tivity  perspective  with  other  perspectives  to  produce  a  more  complete  behavior  based 
malware  detection  system. 

7  Limitations 

Several  protocols  such  as  ARP  and  SMB  were  not  studied.  Their  value  to  enhance  our 
detection  accuracy  is  being  analyzed  and  added  to  current  results.  All  analysis  was  done 
in  a  virtual  machine  which  forcibly  excluded  interesting  malware  samples  that  are  VM 
aware  and  ceased  to  execute  or  masqueraded  as  benign  upon  VM  detection.  The  data 
set  consisted  only  of  malware  samples  which  are  initially  executed  by  a  mouse  double 
click.  Malware  packaged  as  a  dll  file,  kernel  system  service,  or  non-executable  were  not 
used.  We  are  developing  tools  allowing  the  execution  of  any  malware  sample  regardless 
of  its  format. 

8  Conclusion  and  Future  Work 

This  research  analyzes  known  malware  and  benign  samples  in  an  attempt  to  exploit 
differences  in  their  network  activity  behavior  to  accomplish  accurate  behavior  based 
malware  detection.  By  analyzing  and  comparing  known  malware  and  benign  processes, 
we  have  successfully  exploited  differences  in  their  network  activity  behavior  and  pro¬ 
duced  accurate  and  effective  malware  detection  with  minimal  false  positives  and  false 
negatives.  This  was  accomplished  by  producing  a  set  of  behaviors  which  occurred  most 
often  in  our  analyzed  malware  samples  during  which  two  novel  behaviors  frequently 
used  by  malware  were  discovered. 


Our  analysis  results  successfully  clustered  a  diverse  group  of  malware  and  benign 
process  with  very  high  accuracy  and  minimal  false  positives  and  false  negatives.  Clas¬ 
sification  algorithms  correctly  detected  newly  introduced  malware  samples  also  with 
minimal  false  negatives  and  false  positives.  Most  interestingly,  our  data  set  included 
3 1  malware  samples  whose  MD5  sums  were  not  detected  by  any  antivirus  software  on 
Virustotal.com  on  01  April  2010.  These  undetected  malware  were  correctly  identified 
using  our  analysis  in  both  clustering  and  classification  algorithms  with  few  exceptions. 
This  provides  strong  evidence  that  our  identified  behaviors  can  be  used  together  with 
existing  anti-malware  solutions,  especially  signature-based  antivirus  software,  to  help 
stop  zero-day  attacks  on  a  host  machine.  This  research  has  presented  early  results  on 
one  perspective,  namely  network  activity,  of  a  larger  ongoing  project  to  develop  a  be¬ 
havior  based  malware  detection  system. 

Future  work  includes  examining  a  suite  of  protocols  for  yet-to-be  observed  activity 
usable  in  creating  new  behaviors  and  refining  our  current  behavior  set  and  evaluation 
methodology  to  further  increase  detection  effectiveness.  Alos  implementing  our  net¬ 
work  behaviors  in  a  real  time  detection  prototype  to  measure  the  efficiency  of  such  an 
approach  including  resource  usage  in  collecting  data  in  heavy  traffic  flows  and  precise 
measurements  of  elapsed  time  used  to  detect  a  malicious  process. 
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Abstract 

Dynamic  analysis  of  malware  is  an  ever  evolving  and 
challenging  task.  A  malware  infection  tree  (MiT)  can 
assist  in  analysis  by  identifying  processes  and  files  re¬ 
lated  to  a  specific  malware  sample.  In  this  paper  we 
propose  an  abstract  approach  to  building  a  comprehen¬ 
sive  MiT  based  on  rules  describing  execution  events  es¬ 
sential  to  malware  infection  strategies  of  files  and  pro¬ 
cesses.  The  MiT  is  built  using  strong  and  weak  bonds 
between  processes  and  files  which  are  based  on  transi¬ 
tivity  of  information  and  creator /created  relationships. 
The  abstract  approach  facilitates  usage  on  any  operat¬ 
ing  system  platform.  We  implement  the  rules  on  the 
Windows  Vista  operating  system  using  a  custom  built 
tool  named  MiTCoN  which  was  used  in  a  small  scale 
analysis  and  infection  tree  creation  of  a  diverse  set  of 
5800  known  malware  samples.  Results  analysis  revealed 
a  significant  occurrent  of  our  rules  within  a  very  short 
span  of  time.  We  demonstrate  our  rule  set  can  effec¬ 
tively  and  efficiently  build  infection  trees  linking  all  re¬ 
lated  processes  and  files  of  a  specific  malware  sample 
with  no  false  positives.  We  also  tested  the  possible  us¬ 
ability  of  a  MiT  in  disinfecting  a  system  which  yielded 
a  100%  success  rate. 


1  Introduction 

The  release  of  never  before  seen  malware  into  the 
wild  poses  a  severe  global  threat  to  vulnerable  systems 
given  the  difficulty  to  detect  via  signature  based  anti¬ 
malware  programs.  Possible  detection  can  be  achieved 
with  heuristics  but  not  guaranteed  to  fully  eradicate 
the  malware  which  leaves  disinfection  as  the  next  best 
option.  To  comprehensively  disinfect  a  system,  the 
malware  must  be  analyzed.  The  analysis  must  effec¬ 
tively  and  efficiently  detail  the  infection  process  in  a 
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meaningful  way  primarily  documenting  the  files  and 
processes  which  are  created  or  modified  by  the  mal¬ 
ware.  Analyzing  the  infection  process  can  be  aided  by 
building  a  malware  infection  tree  (MiT).  A  MiT  is  a 
directed  tree  structure  where  each  node  represents  a 
file  or  process  and  each  edge  represents  the  execution 
event  rule  causing  node  to  join  the  tree.  To  correctly 
build  a  MiT,  an  understanding  of  the  essential  char¬ 
acteristics  of  malware  infection  is  required.  From  the 
seminal  definitions  provided  by  Cohen  [2]  and  Adleman 
[1] ,  an  executable  file  labeled  a  virus  has  the  fundamen¬ 
tal  ability  to  self-replicate  which  we  consider  to  be  a 
basic  construct  for  a  MiT  and  it  is  essential  to  under¬ 
stand  the  different  ways  in  which  a  virus  can  implement 
this  construct  on  various  operating  systems.  Previous 
work  such  as  [10,  9]  has  shown  some  implementations  of 
self-replication  while  others  have  attempted  to  record 
malware  behavior  using  various  graph  structures  [7,  8]. 
A  more  comprehensive  MiT  includes  processes  modi¬ 
fied  by  malware.  There  are  many  known  ways  [12,  3] 
in  which  a  process  modifies  other  already  running  pro¬ 
cesses.  This  technique  is  primarily  implemented  via  a 
memory  injection  resulting  in  the  modified  process  per¬ 
forming  anomalous,  and  often  nefarious,  events.  Self¬ 
replication  and  memory  injection  create  a  strong  bond 
between  related  processes  and  files  and  are  the  basic 
constructs  of  our  MiT.  Our  MiT  is  further  enhanced 
with  constructs  recoding  the  creation  of  files  and  pro¬ 
cesses  by  a  malware  not  involving  self-replication  or 
memory  injetion  which  create  weak  bonds.  Implement¬ 
ing  the  various  ways  in  which  a  malware  infection  can 
occur  is  highly  OS  dependent.  It  is  imperative  to  col¬ 
lect  needed  data  of  a  malware  infection  in  as  low  a 
level  as  possible  to  assure  building  of  a  comprehensive 
MiT.  Some  malware,  such  as  a  rootkit  [5,  13]  will  exe¬ 
cute  at  deep  or  privileged  OS  levels  hiding  and  avoid¬ 
ing  detection  while  infecting  the  system.  In  this  pa¬ 
per,  we  present  an  abstract  approach  to  building  MiTs 
using  execution  events  rules.  The  rules  describe  exe¬ 
cution  events  essential  to  malware  infection  strategies 
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on  files  and  processes.  MiTs  are  built  based  on  strong 
and  weak  bonds  between  relevant  files  and  processes. 
We  describe  rule  implementation  in  the  Windows  Vista 
operating  system  with  our  custom  tool,  named  MiT- 
CoN,  that  builds  MiTs  in  real  time.  The  tool  analyzed 
and  built  MiTs  of  over  5800  diverse  known  malware 
samples.  We  evaluate  the  efficiency  of  our  approach 
by  recording  system  stability  during  MiT  creation  and 
timewise  analysis  of  MiT  creation.  Effectiveness  was 
evaluated  by  measuring  frequency  of  rules,  timewise 
occurrence  of  rules,  a  comparison  to  infection  struc¬ 
tures  of  comparable  systems  for  false  positive  produc¬ 
tion,  and  attempt  system  disinfection  using  only  the 
MiT  as  a  guide.  Our  analysis  revealed  our  MiTs  were 
constructed  within  7  seconds  of  initial  execution  with¬ 
out  any  noticeable  system  instability  .  Our  disinfection 
attempts  yielded  a  100%  success  rate  implying  MiTs 
may  be  useable  in  real  system  disinfection  scenarios. 
The  contributions  of  our  paper  are  as  follows: 

•  Propose  an  abstract  approach  to  building  malware 
infection  trees  (MiTs). 

•  Define  execution  event  rules  describing  essential 
components  of  infection  strategies. 

•  Describe  implementation  in  the  Windows  Vista 
OS  User  and  Kernel  levels. 

Several  comparable  systems  such  as  Anubis,  BitBlaze, 
JoeBox,  CwSandBox,  and  Malheur  perform  dynamic 
analysis  of  a  submitted  sample  and  return  a  tree  like 
structure  of  related  hies  and  processes.  The  tree  edges 
are  based  on  operating  system  specific  execution  events 
which  objectively  link  two  nodes  together.  The  crit¬ 
ical  problem  is  tree  creation  is  based  on  interpreta¬ 
tion  of  collected  execution  data  which  can  result  in  a 
tree  that  loosely  relates  hies  and  processes  which  may 
not  even  be  part  of  the  malware  infection  such  as  pro¬ 
cesses  and  hies  routinely  used  in  standard  OS  opera¬ 
tions.  This  misinformation  does  not  correctly  represent 
a  malware  infection  and  can  result  in  false  positives. 
Our  approach  creates  more  meaningful  trees  by  creat¬ 
ing  execution  event  rules  based  on  fundamental  mal¬ 
ware  infection  characteristics  producing  fundamental 
bonds  between  nodes  possibly  reducing  false  positives. 
The  result  of  our  approach  is  a  tightly  bound  tree  struc¬ 
ture  containing  minimal  or  no  non-related  data  leading 
to  a  minimization  of  false  positives  and  a  much  more 
realistic  representation  of  the  analyzed  malware’s  in¬ 
fection  strategies.  The  rest  of  this  paper  is  organized 
as  follows:  Section  2  presents  the  rules  used  to  create 
a  MiT,  Section  3  is  our  tool  implementation,  Section  4 
is  our  analysis  &  results,  and  Section  5  is  conclusions 
&  future  work. 
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Figure  1.  MiT  strong  &  weak  bonds 

2  MiT  Construction  Rules 

Strong  and  Weak  Bonds.  An  essential  compo¬ 
nent  of  our  MiT  building  approach  is  linking  nodes  to¬ 
gether  in  a  strong  or  weak  bond,  illustrated  in  Figure 
1,  relationship  based  on  the  malware  infection  related 
interaction  between  two  nodes  during  execution.  We 
dehne  a  strong  bond  between  a  source  node  Q  and 
destination  node  P  when  a  transfer  of  data  from  Q  to 
P  occurs.  The  data  is  part  of  Q  and  transferring  it 
to  P  strongly  bonds  both  nodes  based  on  the  transi¬ 
tivity  of  data  creating  an  intersection  of  identical  data 
between  Q  and  P.  We  dehne  a  weak  bond  when  a 
source  node  Q  arbitrarily  creates  a  destination  node  P 
and  there  is  no  transitivity  of  data.  This  weak  bond 
can  be  viewed  as  a  creator/created  relationship  since 
P  exists  because  of  Q  but  there  is  no  intersection  of 
identical  data  between  both.  Other  comparable  sys¬ 
tems  create  infection  trees  based  only  observed  execu¬ 
tion  events  and  their  nodes  are  not  linked  in  a  fun¬ 
damentally  meaningful  way  allowing  addition  of  non- 
related  nodes  to  the  tree.  The  bonds  between  nodes  of 
our  MiTs  are  dehned  in  the  fundamental  realm  of  mal¬ 
ware  infection  strategies  producing  an  infection  tree 
with  the  reduced  likelihood  of  including  non-malware 
related  nodes.  A  strongly  bonded  MiT  will  consist  of 
essential  hies  and  processes  directly  descending  from 
the  original  malware  executable  due  to  the  transitivity 
of  data  between  nodes  and  should  be  eradicated  hrst  in 
a  disinfection  strategy  to  prevent  further  infection  and 
injury.  Enhancing  the  MiT  with  weak  bonds  provides 
those  hies  and  processes  that  may  not  be  essential  to 
the  malware’s  infection  and  injury  but  still  should  be 
eradicated  during  disinfection. 

Construction  Rules.  We  dehne  a  malware  infec¬ 
tion  tree  (MiT)  as  the  output  of  a  function  M(-)  on 
inputs  X  with  a  set  of  execution  event  rules  f?,  where 
X  is  the  executable  hie  being  analyzed,  and  R  contains 
a  set  of  rules  ri...rn  that  dehne  the  conditions  under 
which  an  object  becomes  part  of  a  MiT.  M(X)  outputs 
a  MiT  with  a  directed  graph  as  MiT  =  ( N ,  E),  where 
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TV  is  a  set  of  nodes  ni...nz  with  z  >2,  and  E  is  a  set 
of  ordered  pairs  of  nodes  (■ rii,nj )  £  TV 2  that  create  an 
edge  between  n,  and  rij .  A  node  n  is  either  a  file  in  the 
file  system  or  a  currently  running  process.  We  assume 
X  is  always  the  root  node  of  a  MiT.  Once  a  new  node 
n  is  added  to  a  MiT,  its  infection  execution  events  are 
recorded  and  used  to  add  further  nodes  to  the  MiT.  If 
z  =  1,  then  only  one  node  n,  presumably  the  root  node 
X,  is  present  in  the  MiT  and  thus  no  tree  was  created. 
MiT  construction  is  based  on  the  rules  in  R.  The  rules 
are  mostly  based  on  the  fundamental  definition  of  mal¬ 
ware,  Cohen  [2]  &  Adleman  [1] ,  which  stipulate  a  mal¬ 
ware,  specifically  viruses  and  worms,  must  replicate  to 
be  classified  as  such.  The  rules  also  reflect  a  malware’s 
tendency  to  nefariously  modify  running  processes. 

File  System  Rules.  We  consider  a  malware  can 
infect  via  self  replication  into  the  file  system  in  two 
primary  ways.  First,  using  a  call  such  as  copy(n,m ), 
where  n  is  a  node  of  a  MiT  and  the  caller  of  the  copy 
function.  In  this  case  n  creates  a  new  file  which  is  an 
exact  copy  of  itself.  Second,  using  a  series  of  calls  such 
as  read(n,  q);write(q,m),  where  n  is  a  node  of  a  MiT 
and  the  caller  of  read  and  write,  q  is  some  temporary 
storage,  and  m  is  an  already  existing  file  that  is  being 
modified  by  n.  The  modification  can  be  achieved  by 
n  prefixing,  suffixing,  overwriting  or  randomly  writing 
some  or  all  of  its  own  data  into  the  file  to.  Note  in  both 
cases,  n  is  the  source  of  the  replications  and  the  caller 
of  the  operations,  therefore  n  is  invoking  self-reference 
replication  as  described  in  [9] .  We  also  consider  a  mal¬ 
ware  that  can  infect  a  system  without  requiring  self 
replication.  This  can  be  donw  through  the  arbitrary 
creation  of  files  such  as  createfile(m)  in  the  file  sys¬ 
tem,  where  an  existing  node  n  calls  the  operation  and 
creates  a  file  that  contains  no  data  originating  in  n. 
There  are  many  known  malware  samples  that  create 
files  during  execution  for  several  reasons  such  as  logs, 
configurations  or  to  store  data  later  sent  to  a  remote 
host.  Even  though  these  files  are  not  created  via  self 
replication  they  are  valid  components  of  an  infection 
and  should  be  included  in  a  MiT.  Based  on  these  con¬ 
siderations,  we  define  the  following  three  file  system 
execution  event  rules  for  the  construction  of  a  MiT: 

/l:Infection  via  self  replication.  A  file  m  be¬ 
comes  a  new  node  n  £  TV  of  a  MiT  if  and  only  if 
P(replicate(P,m))  —>  True  where  a  currently  running 
process  P  is  a  pre-existent  node  in  a  MiT  and  P  has 
issued  a  replication  request  where  P  itself  is  the  source 
parameter  and  m  is  the  destination  parameter,  in  be¬ 
comes  a  new  node  in  the  MiT  where  the  node  for  P 
is  located  and  the  outgoing  edge  (P,  to)  is  labeled  / 1. 
Note  that  in  can  be  a  newly  created  file  by  P  or  an 
existing  file  and  the  replication  from  P  to  m  can  be 


complete  or  partial.  / 1  exhibits  transitivity  of  data 
and  is  labeled  a  strong  bond. 

/2:Infection  via  arbitrary  file  creation.  A  file 
in  becomes  a  new  node  n  £  TV  of  a  MiT  if  and  only  if 
P(filecreate(rn))  — >  True  where  a  currently  running 
process  P  is  a  pre-existent  node  in  a  MiT  and  P  has 
issued  a  file  creation  request  where  in,  a  newly  created 
file,  is  the  destination  parameter,  m  becomes  a  new 
node  in  the  MiT  where  the  node  for  P  is  located  and 
the  outgoing  edge  (P,  to)  is  labeled  / 2.  / 2  is  a  cre¬ 
ator/created  relationship  with  no  transitivity  of  data 
and  is  labeled  a  weak  bond. 

/3:Infection  via  arbitrary  file  write  modifica¬ 
tion.  A  file  m  becomes  a  new  node  n  £  TV  of  a  MiT 
if  and  only  if  P(filewrite(m))  — >  True  where  a  cur¬ 
rently  running  process  P  is  a  pre-existent  node  in  a 
MiT  and  P  has  issued  a  file  write  request  where  to,  a 
pre-existing  file,  is  the  destination  parameter,  to  be¬ 
comes  a  new  node  in  the  MiT  where  the  node  for  P 
is  located  and  the  outgoing  edge  (P,  to)  is  labeled  / 3. 
Note  P  may  modify  to  either  by  prefixing,  suffixing, 
overwriting  or  randomly  writing  data  into  in.  The  es¬ 
sential  component  of  / 3  is  the  data  being  written  to  in 
comes  from  some  other  location  and  not  from  P.  / 3 
does  not  exhibit  transitivity  of  data  and  is  labeled  a 
weak  bond. 

Process  Rules.  We  consider  malware  that  can  in¬ 
fect  a  system  via  process  manipulation  in  two  primary 
ways.  First,  a  malware  can  modify  the  memory  range 
of  a  currently  running  process.  This  is  typically  done 
via  dynamic  code  injection  [12,  3].  A  malware  will 
write  (inject)  nefarious  code  into  the  allocated  mem¬ 
ory  of  some  other  process  and  then  spawn  a  new  pro¬ 
cess  instance  which  executes  the  just  injected  code. 
This  results  in  the  victim  process  performing  execution 
events  it  would  otherwise  not  do  under  benign  con¬ 
ditions.  Several  known  malware  inject  nefarious  code 
into  system  critical  processes.  Malware  authors  assume 
there  is  a  high  unlikelihood  that  a  user  would  termi¬ 
nate  these  processes  since  they  are  considered  critical 
to  OS  functionality.  Second,  malware  will  create  pro¬ 
cesses  from  the  static  file  images  of  executables  that 
either  were  created  or  downloaded  to  the  system.  Mal¬ 
ware  is  known  to  copy  or  download  from  remote  ma¬ 
licious  servers  to  the  system  other  malware  as  part  of 
its  payload,  in  many  cases  a  trojan(s)  such  as  password 
stealer,  key  loggers,  and  spam  engines.  In  some  cases, 
a  malware  may  self  replicate  by  spawning  multiple  pro¬ 
cesses  of  itself  running  on  a  system  to  either  overwhelm 
the  system  or  survive  an  anti-malware  detection  and  re¬ 
moval  attempt.  Process  manipulation  is  a  very  power¬ 
ful  tool  for  malware  facilitating:  system  compromise, 
increased  chances  of  detection  survival,  and  delegate 
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nefarious  goals  to  benign  processes.  Injection  is  par¬ 
ticularly  powerful  allowing  nefarious  deeds  to  possibly 
go  unnoticed  when  carried  out  by  benign  system  criti¬ 
cal  processes.  Based  on  these  considerations,  we  define 
the  following  two  process  execution  event  rules  for  the 
construction  of  a  MiT: 

pl:Infection  via  dynamic  code  injection  of 
a  currently  running  process.  The  static  file 
image  m  of  a  currently  running  process  P  be¬ 
comes  a  new  node  n  £  N  of  a  MiT  if  and  only 
if  Q(codeinject(P,d))',Q(newprocessinstance(P))  — > 
True  where  Q  is  some  currently  running  process  and 
a  pre-existent  node  of  a  MiT,  writes  data,  presumably 
code  instructions  stored  in  Q’s  memory  space  or  static 
file  image,  into  the  allocated  memory  space  of  P  and 
then  spawns  a  new  instance  of  P  which  executes  the 
just  written  data,  m  becomes  a  new  node  in  the  MiT 
where  the  node  for  Q  is  located  and  the  outgoing  edge 
(Q,  to)  is  labeled  pi.  P  is  a  pre-existing  and  presumed 
benign  currently  running  process  of  the  system  which 
gets  nefariously  modified  by  Q  in  memory.  The  static 
file  image  m  of  the  modified  process  P  is  never  modified 
by  Q.  pi  exhibits  transitivity  of  data  and  is  labeled  a 
strong  bond. 

p2:Infection  via  process  spawning.  The  static 
file  image  to  of  a  currently  running  process  P  be¬ 
comes  a  new  node  n  G  N  of  a  MiT  if  and  only  if 
Q(newprocess{m ))  —>  True  where  Q,  a  currently  run¬ 
ning  process  and  a  pre-existent  member  of  a  MiT,  in¬ 
vokes  the  command  to  create  a  new  process  P  from  a 
static  file  image  to.  The  static  file  image  to  is  also  a 
pre-existent  node  of  the  same  MiT  as  Q.  An  outgoing 
edge  ( Q,m )  is  created  and  labeled  p2  which  stores  in¬ 
formation  about  the  static  file  image  in  and  the  newly 
created  process  P.  In  this  scenario,  Q  spawns  new  pro¬ 
cesses  from  files  that  are  already  part  of  the  malware’s 
MiT  including  Q’s  own  static  file  image.  Some  of  these 
files  were  either  created  or  downloaded  by  some  node 
of  the  MiT.  The  key  element  in  this  rule  is  that  m  is  a 
malware  related  file  and  part  of  the  MiT.  As  opposed 
to  pi  where  malware  injects  nefarious  code  into  be¬ 
nign  processes  then  creates  a  new  instance,  p2  creates 
a  process  from  a  nefarious  static  file  image  to.  The  ne¬ 
fariousness  of  to  is  based  on  its  pre-existing  inclusion 
as  a  node  of  a  MiT.  It  is  possible  for  to  to  be  created 
by  a  process  R  using  the  file  system  rules  above  and 
then  spawned  as  a  new  process  by  Q.  Both  R  and  Q 
are  separate  nodes  of  the  same  MiT,  this  would  pro¬ 
duce  two  incoming  edges  to  m :  one  for  the  file  creation 
and  one  for  the  process  spawning.  Note  that  to  can 
be  the  static  file  image  of  Q1  meaning  p2  also  records 
when  Q  creates  a  new  process  instance  of  itself.  In  this 
case  the  outgoing  edge  ( Q ,  m)  contains  the  same  static 


file  image  information  for  to  and  different  identifica¬ 
tion  information  for  the  newly  spawned  process  P.  ps 
is  a  creator/created  relationship  with  no  transitivity  of 
data  and  is  labeled  a  weak  bond. 

Implementing  the  file  system  and  process  rules  can 
produce  an  intersection  of  usage  based  on  the  target 
OS.  For  example,  / 2  and  / 3  may  be  invoked  as  sub¬ 
routines  of  / 1,  / 3  may  be  invoked  as  a  subroutine  of 
/ 2,  and  / 2  can  be  invoked  as  a  subroutine  of  both  / 1 
and  / 3.  Another  example  is  the  OS  performing  code 
injection  to  create  both  new  processes  and  instances  of 
already  running  processes.  In  this  case,  extra  informa¬ 
tion  about  the  process  and  its  injector  must  be  acquired 
to  adequately  decide  if  pi  or  p2  has  occurred.  To  allow 
some  reasonable  flexibility  of  the  rules  to  accommo¬ 
date  unavoidable  intersections  we  assume  the  execu¬ 
tion  event  rules  describe  abstract  scenarios  that  may 
invoke  other  rules  as  subroutines  and  explicitly  recog¬ 
nizing  the  use  of  these  other  rules  is  not  required.  A 
MiT  can  have  multiple  edges  resulting  from  file  system 
and  process  rules  occurring  on  the  same  node.  When 
building  a  MiT,  it  is  allowable  to  have  multiple  edges 
between  nodes  reflective  the  rule  justifying  the  edge’s 
existence.  A  possible  scenario  may  be  a  process  P  cre¬ 
ates  a  file  to  which  then  creates  a  hie  o.  The  hie  o  is 
then  spawned  as  a  new  process  by  P.  This  would  create 
two  incoming  edges  to  o,  the  hrst  edge  is  (to,  o)  labeled 
/ 1  and  the  second  edge  is  (P,o)  labeled  p2.  Having 
a  nrulti-edged  MiT  provides  richer  data  for  disinfec¬ 
tion.  Based  on  the  MiT  a  user  can  assess  which  nodes 
should  be  dealt  with  hrst,  perhaps  using  the  number 
of  incoming  and  outgoing  edges  as  a  weighted  deter¬ 
mination  scale.  Those  nodes  with  greater  number  of 
incoming/outgoing  edges  may  receive  higher  priority 
over  other  less  populated  nodes.  Once  a  MiT  is  created, 
some  of  the  hies  and  processes  belonging  to  the  MiT 
may  no  longer  exist.  Several  known  malware  samples 
delete  hies  and  terminate  processes  which  may  belong 
to  its  own  MiT.  A  typical  case  may  be  a  malware  X  cre¬ 
ates  several  descendant  hies  some  of  which  are  spawned 
as  new  processes.  Then  some  time  later  one  of  these 
descendant  processes  terminates  X  and/or  erases  its 
static  hie  image  from  the  hie  system.  Another  scenario 
is  X  or  any  other  spawned  processes  terminating  itself 
and/or  deleting  its  own  hie  from  the  system.  These  sce¬ 
narios  do  not  invalidate  the  MiT  as  the  hie  or  process 
did  in  fact  exist  at  the  moment  of  addition  as  a  node 
to  the  MiT.  The  key  challenge  in  implementing  these 
rules  is  understanding  the  various  ways  a  hie  or  process 
can  be  created  or  modihed  in  a  specihc  OS.  Once  this 
is  understood,  an  implementation  can  be  created  at  an 
appropriate  OS  level  that  captures  all  or  most  of  the 
studied  implementations.  We  address  this  challenge  in 
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the  next  section  which  presents  our  MiT  construction 
tool  for  the  Windows  Vista  OS  platform. 

3  MiTCoN:  Windows  Vista  Implemen¬ 
tation 

Our  malware  infection  tree  construction  tool,  named 
MiTCoN,  is  a  Windows  command  line  application 
which  implements  the  execution  event  rules  in  the  Win¬ 
dows  Vista  platform  which  outputs  a  table  representa¬ 
tion  of  a  MiT.  The  MiT  is  built  in  real  time  by  monitor¬ 
ing  the  samples’  execution  behaviors.  MiTCoN  takes 
as  input  the  absolute  path  of  the  target  WIN32  PE  ex¬ 
ecutable  set  as  the  MiT’s  root  node  which  facilitates 
MiT  building  by  knowing  the  process  from  which  to 
start  monitoring  execution  behaviors. 

Implementing  File  System  Rules.  To  detect 
when  / 1,  / 2,  and  / 3  occur  by  some  process  P ,  MiT¬ 
CoN  traces  a  set  of  file  system  functions  located  in 
the  Windows  kernel  using  a  form  of  function  hooking 
[5,  13].  These  functions  belong  to  the  Zw  family  [14] 
and  are  located  in  the  Windows  SSDT  table  [11].  Win¬ 
dows  provides  several  ways  to  create  and  modify  files 
at  the  user  level.  When  these  commands  get  passed 
down  to  the  kernel  level,  Windows  merges  them  into  a 
handful  of  Zw  functions.  MitCon  traces  two  sequences 
of  Zw  function  calls  that  successfully  implement  / 1, 
and  one  sequence  for  / 2  and  / 3.  The  first  sequence 
to  determine  the  occurrence  of  /l,  infection  via  self 
replication,  for  some  process  P,  MiTCoN  traces  the 
sequence  of  Zw  functions  with  appropriate  parameters 
listed  in  Table  1.  The  key  to  determining  that  / 1  has 
occurred,  is  to  establish  that  P  is  referencing  itself  and 
is  the  source  of  the  write  operation.  According  to  Ta¬ 
ble  1,  self- replication  starts  with  P  opening  itself  with 
read  access  in  ZwCreateFile  where  sourcepath  is  the 
absolute  path  of  P.  The  functions  ZwCreateSection 
and  ZwMapViewofSection  use  the  file  handle  returned 
from  ZwCreateFile  to  map  the  data  that  is  going  to 
be  written  from  P  into  memory.  Finally,  ZwWriteFile 
reads  the  data  stored  in  baseaddress,  which  is  returned 
by  ZwMapViewofSection,  into  the  target  hie.  At  this 
point,  an  instance  of  / 1  has  been  completed  by  P.  Note 
in  MapViewofSection,  the  in  parameter  processhandle 
is  assured  to  be  the  value  -1.  This  indicates  the  map¬ 
ping  will  be  of  the  caller  process  P.  This  sequence  of 
calls  is  used  when  P  makes  an  exact  copy  of  itself  in  a 
newly  created  hie. 

The  second  sequence  to  determine  the  occurrence  of 
/ 1  for  some  process  P,  MiTCoN  traces  the  sequence 
of  Zw  functions  with  appropriate  parameters  listed  in 
Table  2.  The  second  sequence  in  determining  / 1  is 


ZwCreateFilc(in:read_access,  in:sourcepath,  out:filehandle); 

ZwCreateSection(in:filehandle,  out:sectionhandle); 

ZwMapViewofSection(in:sectionhandle, 
in : processhandle ,  out:baseaddress); 

ZwWriteFile(in:baseaddress,  out:targetfilepath) 

Table  1. 1st  Function  Sequence  Used  in  / 1 

simpler  involving  only  two  Zw  functions.  The  source 
path  in  ZwReadFile  refers  to  the  absolute  path  of  P 
and  the  memaddress  is  used  as  temporary  storage  of 
the  data  from  P  which  is  written  to  targethlepath  in 
ZwCreateFile.  This  sequence  of  calls  is  used  primarily 
when  P  is  self  replicating  into  already  existing  hies. 

ZwReadFile(in:sourcepath,  out:memaddress); 

ZwWriteFile(in:memaddress,  out:targetfilepath); 

Table  2.  2nd  Function  Sequence  Used  in  / 1 

To  determine  the  occurrence  of  / 2,  infection  via 
arbitrary  hie  creation,  for  some  process  P,  MiTCoN 
traces  with  appropriate  parameters  the  ZwCreateFile 
function  as  listed  in  Table  1.  The  only  parameter  con¬ 
sidered  is  sourcepath  which  is  assured  not  to  be  the 
name  and  hie  system  location  of  the  caller  process  P. 
With  this  assurance,  P,  the  caller  process,  is  creating  a 
completely  new  hie  with  the  name  and  hie  system  loca¬ 
tion  stored  in  sourcepath.  To  determine  the  occurrence 
of  / 3,  infection  via  arbitrary  hie  write  modihcation,  for 
some  process  P,  MiTCoN  traces  with  appropriate  pa¬ 
rameters  the  sequence  of  functions  listed  in  Table  2, 
which  is  one  of  the  sequences  used  for  / 1.  The  differ¬ 
ence  is  in  establishing  an  occurrence  of  / 3,  the  sour¬ 
cepath  parameter  in  ZwReadFile  is  assured  not  to  be 
the  hie  system  location  of  P.  With  this  assurance  / 3  is 
determined  since  the  data  being  written  is  from  some 
other  part  of  the  system  and  not  from  P. 

Implementing  Process  Rules.  To  detect  when 
pi  and  p2  occurs  by  some  process  P,  MiTCoN  per¬ 
forms  function  hooking  on  kernel  level  Zw  functions 
and  user  level  API  functions.  Several  of  the  various 
ways  in  which  a  process  can  be  injected  and  spawned 
at  the  user  level  filter  down  to  a  handful  of  Zw  functions 
in  the  kernel.  In  Windows,  a  process  can  be  spawned  in 
two  primary  forms:  a  WIN32  process  and  a  Windows 
service.  MiTCoN  traces  one  sequence  of  function  calls 
for  pi,  and  two  sequences  for  p2.  To  determine  the  oc¬ 
currence  of  pi,  infection  via  dynamic  code  injection  of 
a  currently  running  process,  for  some  process  P,  MiT¬ 
CoN  traces  with  appropriate  parameters  the  sequence 
of  function  calls  listed  in  Table  3.  This  sequence  of 
functions  facilitates  the  injection  of  data  between  pro¬ 
cess  allocated  memories.  MiTCoN  assures  P  is  the 
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caller  of  all  three  functions.  Memory  is  first  allocated 
with  ZwAllocateVirtualMemory  in  the  process  identi¬ 
fied  by  processhandle  starting  at  baseaddress.  these 
two  parameters  are  again  used  to  write  (inject)  data  by 
P  in  the  allocated  memory  of  processhandle.  Finally 
P  creates  a  new  instance  of  the  just  injected  process 
with  CreateThread  which  causes  the  new  instance  to 
execute  the  newly  written  nefarious  code. 

ZwAllocateVirtualMemory (in:processhandle,  out:baseaddress); 

ZwWriteVirtualMemory (improcesshandle,  imbaseaddress) ; 

ZwCreateThread(in:processhandle,  out:threadhandle) ; 

Table  3.  Function  Sequence  Used  in  pi 

The  first  sequence  to  determine  the  occurrence  of 
p2 ,  infection  via  process  spawning,  for  some  process 
P,  MiTCoN  traces  with  appropriate  parameters  the 
function  call  listed  in  Table  4.  This  single  function 
suffices  to  create  a  new  process  from  the  static  file  im¬ 
age  detailed  in  objectattributes.  The  function  returns 
a  handle,  in  processhandle,  to  the  newly  created  pro¬ 
cess.  MiTCoN  checks  if  the  newly  created  process  is  of 
a  static  file  image  that  is  already  a  node  of  P’s  MiT.  If 
yes,  then  an  edge  is  added. 

|  ZwCreateProcess(in:objectattributes,  outiprocesshandle)  | 

Table  4. 1st  Function  Sequence  Used  in  p2 

The  second  sequence  to  determine  the  occurrence  of 
p2  for  some  process  P,  MiTCoN  traces  with  appropri¬ 
ate  parameters  the  sequence  of  function  calls  listed  in 
Table  5.  CreateService  will  use  the  file  located  in  Bi- 
naryPath  as  the  service  to  be  registered  and  returns 
in  serviceliandle  a  handle  identifying  the  service.  This 
handle  is  used  in  both  OpenService  and  StartService, 
the  end  of  which  results  in  the  service  running  on  the 
system.  Note  the  function  CreateService  has  an  addi¬ 
tional  parameter  that,  given  the  proper  value,  can  start 
the  service  immediately.  If  this  occurs,  MiTCoN  will 
not  need  to  detect  the  invocation  of  OpenService  and 
StartService. 

CreateService(in:BinaryPath,  out:servicehandle); 

OpenService(inout:  servicehandlc); 

StartServicefin:  servicehandle) ; 

Table  5.  2nd  Function  Sequence  Used  in  p2 

MiTCoN  assures  P  is  the  caller  process  for  every 
function  in  a  sequence  by  invoking  GetCurrentPro- 
cessQ  at  both  the  user  and  kernel  levels.  As  rules  are 
identified,  the  appropriate  MiT  is  updated  with  new 
nodes  and  edges.  The  completed  MiT  presents  all  the 


files  and  processes  identified  as  the  source  or  destina¬ 
tion  of  a  given  rule.  The  number  of  edges  going  in 
and  out  of  any  one  node  represents  the  number  of  rule 
instances. 

MiTCoN  Example.  We  present  the  MiT  out¬ 
put  of  MiTCoN  in  Table  6  and  its  tree  graph  in  Fig¬ 
ure  2  for  the  malware  Backdoor. Win32. Poison.  Ta¬ 
ble  6  has  three  columns:  Source  identifies  the  caller 
process  of  the  operation(s),  Rule(s)  gives  the  execu¬ 
tion  event  rules  invoked  by  Source,  and  Destination 
gives  the  target  of  the  execution  event.  Each  row  is 
a  complete  invocation  of  a  rule  forming  a  node  pair 
(source, destination)  and  Rule  is  the  label  for  the  out¬ 
going  edge.  The  first  row’s  Source  file  name  is  the 
root  of  the  MiT.  In  the  first  row,  Poison  performed 
rule  / 1  on  svchest.exe,  read  as:  Poison  self-replicated 
into  svchest.exe.  The  second  to  last  row  shows  svch¬ 
est.exe  performed  rules  / 1  and  p2  on  svchvst.exe,  read 
as  svchest.exe  replicated  into  and  spawned  new  pro¬ 
cess  svchvst.exe.  Poison  has  four  child  nodes:  svch¬ 
est.exe.  l.bat,  l.reg,  and  l.vbs.  Of  these  four,  Poison 
self-replicated  (/ 1)  into  svchest.exe  and  l.vbs  and  cre¬ 
ated  (/ 2)  l.bat  and  l.reg. 


Source 

Rule(s) 

Destination 

Poison 

/i 

svchest.exe 

Poison 

/  2 

l.bat 

Poison 

/  2 

l.reg 

Poison 

/I 

l.vbs 

Poison 

pi 

wscript.exe 

wscript.exe 

pi 

cmd.exe 

cmd.exe 

pi 

regedit.exe 

cmd.exe 

pi 

attrib.exe 

cmd.exe 

pi 

reg.exe 

cmd.exe 

pi 

reg.exe 

cmd.exe 

p2 

svchest.exe 

cmd.exe 

pi 

ping.exe 

svchest.exe 

/1,P2 

svchvst.exe 

svchvst.exe 

/I 

svchist.exe 

Table  6.  MiT  of  Backdoor.Win32. Poison 


4  Evaluation  &  Results 

Using  MiTCoN,  we  analyzed  and  built  MiTs  for 
5800  diverse  known  malware  samples  randomly  se¬ 
lected  from  GFI  Sandbox  malware  repository  [4]  up¬ 
loaded  between  April  and  June  2011.  According  to 
Kaspersky  anti-virus,  the  samples  were  classified  as: 
Trojans,  Worms,  Viruses,  Adware,  Spyware,  FakeAVs, 
Monitors,  Risk  Tools,  PSW  Tools,  Hoaxes,  web  Tool 
Bars,  Porn  Dialers,  Downloaders,  Remote  Admin 
Tools,  IRC  Clients,  P2P  worms,  Email  worms,  Back¬ 
doors,  Bots,  Bankers,  Clickers,  Ransom,  Packed,  Game 
Thiefs,  Exploits,  Rootkits,  and  Droppers.  Analysis  was 
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Figure  2.  MiT  graph  of  Back- 
door.Win32. Poison 


conducted  in  VMWare  Workstation  with  a  fresh  install 
of  Microsoft  Windows  Vista  logged  into  the  default  ad¬ 
ministrator  user  account.  MiTCoN  and  each  sample 
were  executed  for  three  minutes,  the  MiT  enhanced 
with  timestamps  was  saved  to  a  database  and  the  snap¬ 
shot  reverted  to  a  clean  state.  MiTCoN  proved  to 
be  highly  efficient  by  not  causing  system  instability  or 
crashes  during  analysis,  CPU  usage  per  sample  peaked 
at  3%  and  averaged  less  than  1%.  Total  RAM  memory 
usage  never  surpassed  14MB  with  MiTCoN  executing. 
MiTCoN  never  took  more  than  7  seconds  to  build  and 
record  a  MiT  and  averaged  these  operations  at  3.1  sec¬ 
onds.  There  was  a  high  frequency  of  rule  occurrence  in 
all  analyzed  malware  samples  with  overall  totals  as  fol¬ 
lows:  /1:662,  /2:14396,  /38629,  pl:647,  p2:3490  with 
the  top  three  malware  classes:  Trojan:  293,  Worm:63, 
Backdoor:56.  Every  analyzed  sampled  had  multiple  oc¬ 
currences  strong  bond  rules  implying  strongly  bonded 
MiTs  may  suffice  to  understand  the  malware’s  essen¬ 
tial  infection  strategy  with  weak  bonds  enhancing  the 
MiT.  In  the  instances  of  pi,  we  infered  the  malware 
was  delegating  nefarious  deeds  off  to  seemingly  benign 
processes  in  order  to  achieve  their  goals  while  not  be¬ 
ing  identified.  The  majority  of  processes  were  spawned 
as  WIN32  with  a  several  being  pre-existing  nodes  in 
the  MiT  itself  (p2).  Many  files  where  spawned  as  a 
Windows  service  and  not  a  regular  WIN32  process. 
Detecting  services  proved  critical  to  building  compre¬ 
hensive  trees  since  the  node’s  subtree  was  substantial. 
Many  nodes  had  multiple  incoming  and/or  outgoing 
edges  which  illustrates  a  strong  bond  between  file  and 
process  manipulation  by  malware.  The  most  common 


combination  was  /l,p2  where  malware  would  self  repli¬ 
cate  then  spawn  the  process  to  have  multiple  instances 
running  on  the  machine.  We  conjecture  this  avoids 
complete  malware  eradication  or  overwhelms  the  sys¬ 
tem  facilitating  compromise.  Analyzing  timestamps  of 
rule  occurrence,  we  discovered  all  instances  of  every 
rule  occurred  within  200  milliseconds  from  initial  mal¬ 
ware  execution  with  an  overall  average  12  milliseconds, 
the  strong  bond  rules  averaged  11  milliseconds  for  / 1 
and  14  milliseconds  for  pi.  We  randomly  selected  120 
malware  samples  and  executed  them  on  the  analysis 
platforms  Anubis  and  GFI  SandBox.  The  resulting 
infection  tree  structures  from  both  were  compared  to 
the  MiT’s  created  by  MiTCoN.  In  114  samples  both 
platforms  included  nodes  (either  a  process  or  file)  that 
was  excluded  in  our  MiT.  Further  analysis  revealed 
these  nodes  represented  files  and  processes  belonging 
to  standard  Windows  process  operations  and  were  not 
part  of  the  malware  infection  and  are  therefore  false 
positives.  A  typical  scenario  was  the  inclusion  of  ser¬ 
vices. exe,  which  is  used  in  Windows  each  time  a  pro¬ 
cess  requests  creation  or  start  of  a  Windows  service. 
Our  approach  to  MiT  building  based  on  fundamental 
malware  infection  seems  to  create  more  relevant  MiTs 
with  the  ability  to  exclude  files  and  processes  belonging 
to  standard  Windows  operations.  The  high  frequency, 
early  occurrence  and  lack  of  false  positives  makes  our 
rules  for  building  MiTs  highly  effective  in  analyzing 
malware. 

We  test  the  possible  usability  of  MiTCoN  in  sys¬ 
tem  disinfection  with  our  previously  randomly  chosen 
120  samples  and  use  the  resulting  MiT  to  attempt  dis¬ 
infection  of  the  system.  Testing  was  done  in  VMWare 
Workstation  running  a  snapshot  of  a  fresh  install  of  Mi¬ 
crosoft  Windows  Vista  logged  into  the  default  admin¬ 
istrator  user  account.  Kaspersky  anti-virus  [6]  scanned 
the  infected  system  to  assess  how  successful  our  disin¬ 
fection  attempt  was.  Before  testing,  we  invoked  a  com¬ 
plete  system  scan  by  Kaspersky  which  resulted  with 
no  infections  found.  This  was  done  to  assure  an  ini¬ 
tial  malware  free  testing  environment  and  any  discov¬ 
ered  infections  occurring  after  this  initial  scan  was  at¬ 
tributed  to  the  malware  executed  by  us  in  the  test  sys¬ 
tem.  Our  evaluation  was  performed  in  two  rounds. 
Round  one  of  testing  was  as  follows:  1. Initially,  the 
clean  state  snapshot  is  loaded  2. A  malware  sample 
is  copied  to  the  Windows  desktop  3. MiTCoN  is  exe¬ 
cuted  using  the  sample’s  path  as  input  4. The  sample 
is  executed  for  3  minutes  5. The  resulting  MiT  is  saved 
for  later  use  6. The  infected  snapshot  is  scanned  with 
Kaspersky  &  the  results  saved  7.  Return  to  step  1  with 
next  sample.  The  first  round  of  testing  was  performed 
to  create  MiTs  for  each  of  our  test  samples  and  to  as- 
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sure  Kaspersky  can  detect  the  malware  infection.  In 
every  case,  Kaspersky  detected  the  malware.  Know¬ 
ing  Kaspersky  can  detect  our  malware  sample’s  infec¬ 
tion  was  a  pre-requisite  to  the  second  round  of  testing 
where  Kaspersky  is  used  to  assess  the  effectiveness  of 
our  disinfection  attempt.  Round  two  of  testing  was  as 
follows:  1. Initially,  the  clean  state  snapshot  is  loaded 

2.  A  malware  sample  is  copied  to  the  Windows  desktop 

3.  The  sample  is  executed  for  3  minutes  4.  The  system 
is  manually  disinfected  using  the  samples’s  MiT  from 
round  one  5. The  snapshot  is  scanned  with  Kaspersky 
&  the  results  saved  6. Return  to  step  1  with  next  mal¬ 
ware  sample.  The  main  purpose  of  round  two  was  to 
assess  how  effective  our  disinfection  attempt  was  using 
a  MiT.  In  each  case  the  files  and  processes  listed  in  the 
MiT  that  were  found  in  the  infected  system  were  re¬ 
moved.  Kaspersky  did  not  detect  any  malicious  objects 
in  the  second  round  of  testing  implying  our  MiTs  were 
effective  in  eradicating  infection  from  the  system.  In 
every  case,  the  details  in  the  MiT  sufficed  to  erradicatd 
the  files  and  processes  from  the  system. 

Limitations.  Our  testing  was  conducted  in  a  vir¬ 
tual  machine  which  forcibly  excluded  using  vm-aware 
malware.  MiTCoN  is  limited  by  the  number  of  imple¬ 
mentations  in  which  a  rule  can  be  traced  in  a  specific 
OS.  We  are  continuously  adding  new  OS  specific  im¬ 
plementations  of  a  rule  into  MitCoN  as  well  as  discov¬ 
ering  new  rules  reflecting  malware  infection  strategies. 
Secure  testing  on  actual  machines  is  also  being  crafted 
for  future  use. 

5  Conclusion  &  Future  Work 

We  have  presented  an  abstract  approach  to  building 
comprehensive  MiTs  based  on  rules  describing  execu¬ 
tion  events  essential  to  malware  infection  strategies  of 
files  and  processes.  We  developed  a  MiT  creation  tool, 
named  MiTCoN,  for  the  Windows  Vista  platform  and 
tested  with  5800  known  malware  samples.  Our  analy¬ 
sis  revealed  MiTCoN  was  very  efficient  in  MiT  build¬ 
ing  and  our  rules  were  highly  effective  in  identifying 
the  relevant  malware  infection  files  and  processes  with 
high  occurrence  rate  in  all  samples  completing  in  un¬ 
der  200  milliseconds.  In  contrast  to  similar  systems, 
our  MiTs  avoid  false  positives  by  correctly  excluding 
non-malware  related  processes  and  files.  The  MiTs 
produced  by  our  tool  during  analysis  produced  100% 
successful  manual  system  disinfection  verified  with  a 
post-disinfection  malware  scan.  Our  results  suggests 
our  abstract  approach  using  a  malware  infection  strat¬ 
egy  basis  for  rule  creation  produces  MiTs  which  are 
highly  effective,  improve  analysis  and  disinfection,  and 
produce  minimal  false  positives.  Our  future  work  in¬ 


cludes  adding  new  execution  event  rules,  secure  testing 
in  a  real  machine  and  the  ability  to  create  MiTs  for 
malware  samples  that  are  not  WIN32  executables. 
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PI:  Kyoung-Don  Kang 

In  this  project,  we  have  investigated  research  issues  related  to  networking  monitoring  and  stream 
data  analysis  that  is  a  basis  for  the  security  and  maintenance  of  the  cloud  computing 
infrastructure.  Data  stream  management  systems  (DSMS)  are  developed  to  efficiently  process 
data  streams  in  important  applications,  e.g.,  network  monitoring,  financial  market  analysis,  and 
transportation  management.  As  data  streams  may  be  unbounded  and  arrive  at  high  rates,  most 
DSMS  support  continuous  reevaluation  of  queries  over  sliding  (or  jumping)  windows  of  stream 
data  tuples,  e.g.,  network  monitoring  data  or  stock  price  data  streams.  Using  a  declarative 
language  provided  by  a  DSMS,  a  user  can  easily  write  a  DSMS  application,  which  is  interpreted 
as  a  graph  of  the  interconnected  continuous  query  (CQ)  operators  necessary  to  process  data 
streams  for  the  application. 

Ideally,  DSMS  are  desired  to  process  data  streams  with  a  minimal  delay  to  monitor  critical  real 
world  phenomena,  e.g.,  market  status  changes  or  distributed  denial  of  service  (DDoS)  attacks. 
Ironically,  supporting  this  property  is  most  challenging  when  it  is  most  needed.  For  example,  a 
large  amount  of  data  streams  may  arrive  in  a  short  time  interval  when  the  status  of  a  financial 
market  or  communication  network  is  widely  fluctuating.  Thus,  the  DSMS  can  be  overloaded, 
resulting  in  late  detection  of  DDoS  attacks  or  losses  of  business  opportunities.  To  deal  with 
potentially  bursty  arrivals  of  stream  data,  a  CQ  operator,  e.g.,  a  windowed  select,  project,  join,  or 
aggregation  operator,  usually  buffer  its  input  data  streams  in  memory.  Under  overload,  however, 
the  input  stream  queues  for  buffering  may  grow  rapidly.  As  a  result,  the  system  may  suffer  from 
excessive  data  losses  or  CQ  processing  delays.  It  may  even  become  unreliable  due  to  potential 
thrashing  caused  by  severe  memory  contention. 

To  address  the  problem,  we  have  designed  a  novel  load  shedding  scheme  to  systematically 
control  the  backlog  accumulated  by  each  stream  even  in  the  presence  of  dynamic  workloads.  For 
load  shedding,  a  DSMS  should  determine  1)  when  to  shed  load,  2)  where  to  shed  load,  3)  which 
stream  data  tuples  to  drop,  and  4)  how  much  load  to  shed  [7].  Concerning  1)  and  2),  we  take  an 
immediate,  localized  approach  to  load  shedding.  Our  load  shedding  scheme  runs  immediately 
when  the  amount  of  backlog  of  any  individual  data  stream  changes  by  more  than  a  specified 
threshold.  It  probabilistically  drops  tuples,  if  necessary,  to  control  per-stream  backlog  to  be 
below  the  desired  bound  without  requiring  any  global  coordination,  which  is  computationally 
expensive  and  time  consuming  in  distributed  DSMS.  By  doing  this,  our  approach  avoids  a 
potential  avalanche,  in  which  the  backlog  accumulated  in  a  stream  queue  may  largely  delay  the 
succeeding  CQ  operators  in  a  cascading  manner.  Further,  the  impact  of  tuple  dropping  is  evenly 
spread  across  the  query  operators  cooperating  for  a  specific  application.  If  load  shedding  is 
applied  only  to  the  data  stream  sources,  too  many  queries  can  be  affected,  since  one  stream  may 


fan  out  to  multiple  streams  [7].  On  the  other  hand,  only  performing  load  shedding  at  the  sinks  of 
the  query  graph,  which  produce  the  final  CQ  processing  results,  is  subject  to  excessive  resource 
waste  caused  by  a  large  amount  of  intermediate  data  streams  that  will  be  eventually  dropped. 

Regarding  3)  and  4),  we  drop  tuples  randomly  according  to  the  drop  probability,  because  random 
tuple  dropping  incurs  a  minimal  decrease  of  the  accuracy  of  query  processing  results  without 
making  any  statistical  assumptions  about  data  distributions.  To  determine  how  much  load  to  shed, 
we  apply  fuzzy  control  techniques  [9]  to  derive  the  drop  probability  considering  the  current 
DSMS  status.  Fuzzy  control  is  based  on  fuzzy  logic  developed  to  deal  with  uncertainties 
prevalent  in  real  world  applications.  Due  to  its  robustness,  it  is  widely  applied  to  support 
performance  and  safety  features  in  many  important  systems  including  automobiles  and  consumer 
electronics.  It  can  naturally  express  fuzzy  (uncertain)  concepts  that  are  not  completely  true  or 
false  but  partially  true.  Thus,  it  is  very  effective  to  handle  uncertainties  in  highly-dynamic, 
nonlinear  systems.  In  this  project,  we  apply  it  to  efficiently  handle  uncertainties  in  DSMS 
including  bursty  data  arrivals  and  aperiodic,  data-dependent  CQ  executions.  We  have 
implemented  our  approach  and  compared  its  performance  to  that  of  Borealis — one  of  the  most 
advanced  open-source  DSMS  using  CAIDA  data  comprised  of  real-world  network  traffic  traces 
tailored  for  research.  Our  approach  showed  significant  performance  improvement  in  terms  of 
the  delay  and  data  stream  backlog. 

In  addition  to  the  DSMS  research,  we  have  designed  and  developed  a  GPU-accelerated 
MapReduce  framework  for  high  performance  big  data  analysis.  For  more  details  of  our  work  on 
data  stream  management  systems  and  a  GPU- accelerated  MapReduce,  please  refer  to  the 
attached  papers  [1,2]  that  fully  discuss  the  results  of  the  work.  These  research  results  are 
accepted  to  the  IEEE  International  Conference  on  Sendee  Oriented  Computing  &  Applications 
(SOCA  '12)  and  submitted  to  the  Journal  of  Parallel  and  Distributed  Computing. 
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Abstract — Data  stream  management  systems  (DSMS)  aim  to 
process  massive  data  streams  in  a  timely  fashion  to  support 
important  applications,  e.g.,  financial  market  analysis.  However, 
DSMS  can  be  overloaded  due  to  large  bursts  in  data  stream 
arrivals  and  data-dependent  query  executions.  To  avoid  over¬ 
loads,  we  design  a  new  load  shedding  scheme  by  applying 
distributed  fuzzy  logic  control,  which  is  very  effective  to  deal 
with  uncertainties  in  highly  dynamic  systems  such  as  DSMS, 
based  on  the  per-stream  backlog  and  selectivity  of  each  query 
operator.  We  have  implemented  our  approach  by  extending 
an  open  source  distributed  DSMS.  The  performance  evaluation 
using  high-rate  Internet  traces  shows  that  our  approach  closely 
supports  a  specified  backlog  bound  for  each  data  stream  queue, 
while  improving  the  query  processing  delay,  with  little  overhead. 

I.  Introduction 

Data  stream  management  systems  (DSMS),  including  [1], 
[2],  [3],  [4],  [5],  [6],  are  developed  to  efficiently  process  data 
streams  in  important  applications,  e.g.,  network  monitoring, 
financial  market  analysis,  and  transportation  management.  As 
data  streams  may  be  unbounded  and  arrive  at  high  rates,  most 
DSMS  support  continuous  reevaluation  of  queries  over  sliding 
(or  jumping)  windows  of  stream  data  tuples,  e.g.,  stock  price  or 
network  monitoring  data  streams.  Using  a  declarative  language 
(e.g.,  a  windowed  SQL-like  language  [5]  or  graphical  query 
processing  boxes  [4])  provided  by  a  DSMS,  a  user  can  easily 
write  a  DSMS  application,  which  is  interpreted  as  a  graph  of 
the  interconnected  continuous  query  (CQ)  operators  necessary 
to  process  data  streams  for  the  application. 

Ideally,  DSMS  are  desired  to  process  data  streams  with 
a  minimal  delay  to  monitor  critical  real  world  phenomena, 
e.g.,  market  status  changes  or  denial  of  service  (DoS)  attacks. 
Ironically,  supporting  this  property  is  most  challenging  when 
it  is  most  needed.  For  example,  a  large  amount  of  data  streams 
may  arrive  in  a  short  time  interval  when  the  status  of  a  financial 
market  or  communication  network  is  widely  fluctuating.  Thus, 
the  DSMS  can  be  overloaded,  resulting  in  late  detection  of 
DoS  attacks  or  losses  of  business  opportunities.  To  deal  with 
potentially  bursty  arrivals  of  stream  data,  a  CQ  operator,  e.g.,  a 
windowed  select,  project,  join,  or  aggregation  operator,  usually 
buffer  its  input  data  streams  in  memory.  Under  overload, 
however,  the  input  stream  queues  for  buffering  may  grow 
rapidly.  As  a  result,  the  system  may  suffer  from  excessive 
data  losses  or  CQ  processing  delays.  It  may  even  become 


unreliable  due  to  potential  thrashing  caused  by  severe  memory 
contention. 

To  address  the  problem,  we  design  a  novel  load  shedding 
scheme  to  systematically  control  the  backlog  accumulated  by 
each  stream  even  in  the  presence  of  dynamic  workloads.  For 
load  shedding,  a  DSMS  should  determine  1)  when  to  shed 
load,  2)  where  to  shed  load,  3)  which  stream  data  tuples 
to  drop,  and  4)  how  much  load  to  shed  [7].  Concerning  1) 
and  2),  we  take  an  immediate,  localized  approach  to  load 
shedding.  Our  load  shedding  scheme  runs  immediately  when 
the  amount  of  backlog  of  any  individual  data  stream  changes 
by  more  than  a  specified  threshold.  It  probabilistically  drops 
tuples,  if  necessary,  to  control  per-stream  backlog  to  be  below 
the  desired  bound  without  requiring  any  global  coordination, 
which  is  computationally  expensive  and  time  consuming  in 
distributed  DSMS.  By  doing  this,  our  approach  avoids  a 
potential  avalanche,  in  which  the  backlog  accumulated  in  a 
stream  queue  may  largely  delay  the  succeeding  CQ  operators 
in  a  cascading  manner.  Further,  the  impact  of  tuple  dropping 
is  evenly  spread  across  the  query  operators  cooperating  for  a 
specific  application.  If  load  shedding  is  applied  only  to  the  data 
stream  sources,  too  many  queries  can  be  affected,  since  one 
stream  may  fan  out  to  multiple  streams  [7],  On  the  other  hand, 
only  performing  load  shedding  at  the  sinks  of  the  query  graph, 
which  produce  the  final  CQ  processing  results,  is  subject  to 
excessive  resource  waste  caused  by  producing  a  large  amount 
of  intermediate  data  streams  that  will  be  eventually  dropped. 

Regarding  3)  and  4),  we  drop  tuples  randomly  according 
to  the  drop  probability,  because  random  tuple  dropping  incurs 
a  minimal  decrease  of  the  accuracy  of  query  processing 
results  without  making  any  statistical  assumptions  about  data 
distributions  [5],  [7].1  To  determine  how  much  load  to  shed, 
we  apply  fuzzy  control  techniques  [9]  to  derive  the  drop  prob¬ 
ability  considering  the  current  DSMS  status.  Fuzzy  control 
is  based  on  fuzzy  logic  developed  to  deal  with  uncertainties 
prevalent  in  real  world  applications.  Due  to  its  robustness,  it  is 
widely  applied  to  support  performance  and  safety  features  in 
many  important  systems  including  automobiles  and  consumer 
electronics.  Since  it  can  naturally  express  fuzzy  (uncertain) 

1  Our  approach  is  not  limited  to  random  dropping.  If  the  utilities  of  the  data 
streams  are  known  a  priori  similar  to  [8],  our  approach  can  be  easily  adapted 
to  drop  the  tuple  with  the  smallest  utility  first. 
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concepts  that  are  not  completely  true  or  false  but  partially  true, 
it  is  very  effective  to  handle  uncertainties  in  highly-dynamic, 
nonlinear  systems.  In  this  paper,  we  apply  it  to  efficiently 
handle  uncertainties  in  DSMS  including  bursty  data  arrivals 
and  aperiodic,  data-dependent  CQ  executions. 

Fuzzy  control  is  performed  using  if-precedent-then- 
consequent  rules  derived  based  on  the  logical  understanding  of 
the  fundamental  characteristics  of  the  controlled  system,  e.g., 
a  DSMS.  We  design  a  set  of  novel  fuzzy  rules  to  dynamically 
adjust  the  per-stream  drop  probability,  if  necessary,  to  control 
the  length  of  each  stream  queue  to  be  below  the  specified 
set-point  based  on  the  queue  length  error,  i.e.,  the  difference 
between  the  queue  length  set-point  and  current  length  of 
any  individual  data  stream  queue  in  a  DSMS,  even  in  the 
presence  of  dynamic  workloads.  Moreover,  we  design  another 
set  of  new  fuzzy  rules  to  consider  not  only  the  queue  length 
error  but  also  the  dynamics  of  query  operators  to  drop  just 
enough  tuples,  if  necessary,  to  avoid  overloads.  Specifically, 
we  consider  the  selectivity  of  each  individual  query  operator, 
i.e.,  the  ratio  of  the  number  of  the  output  stream  tuples 
produced  by  an  operator  to  the  number  of  the  input  stream 
tuples  entering  the  operator,  to  further  adjust  the  shedding 
probability.  We  seamlessly  integrate  the  per-stream  queue 
length  and  selectivity  controllers  into  a  single  load  shedding 
framework  to  avoid  both  overloads  and  excessive  data  losses  in 
DSMS.  For  example,  even  when  a  data  stream  queue  is  longer 
than  the  desired  bound,  our  approach  does  not  increase  the 
shedding  probability  for  the  specific  stream,  if  a  large  fraction 
of  the  stream  is  filtered  out  by  a  preceding  query  operator 
before  the  stream  reaches  the  queue  being  controlled  (or  vice 
versa).2 

In  addition,  our  fuzzy  logic  scheme  for  load  shedding  is 
executed  in  an  event-driven  manner;  that  is,  it  is  executed 
when  the  length  of  any  queue,  i.e.,  per-stream  backlog, 
changes  by  more  than  the  specified  event  threshold.  Being 
event-driven,  our  load  shedding  scheme  automatically  executes 
more  often  when  the  per-stream  backlog  increases  or  vice 
versa  unlike  the  other  control  theoretic  techniques  using  a  fixed 
control  period  [10],  [11],  [12].  Further,  our  approach  requires 
neither  centralized  control  nor  global  information  about  the 
system  status.  For  each  stream  queue,  the  shedding  probability 
is  adjusted  considering  the  per-stream  queue  size  error  and 
selectivity  of  the  preceding  CQ  operator.  In  summary,  our 
approach  supports  immediate,  localized  per-stream  backlog 
control  in  an  event-driven  manner  to  control  the  size  of  each 
data  stream  buffer  to  be  below  a  specified  set-point  even  in 
the  presence  of  uncertainties.  Supporting  these  features  is  not 
directly  considered  by  previous  work  on  DSMS  including  [5], 
[4],  [6],  [13],  [14],  [15],  [8],  [16], 

For  performance  evaluation,  we  have  implemented  our 
approach  by  extending  one  of  the  most  advanced  open  source 
DSMS,  Borealis  [4],  in  a  Linux  cluster.  For  performance 

“Note  that  network  congestion  control  is  not  directly  applicable  to  load 
shedding  in  DSMS  [7],  since  it  simply  drops  packets  as  the  buffer  in  a  network 
router  builds  up  without  simultaneously  considering  the  aforementioned  four 
issues  and  CQ  selectivity  important  in  DSMS. 
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Fig.  1.  A  Continuous  Query  Tree 

comparisons  between  our  approach  and  Borealis,  we  use  high¬ 
speed  Internet  traces  [17]  different  from  most  existing  research 
on  DSMS  (including  [5],  [4],  [6],  [18],  [19],  [13],  [14],  [15], 
[8],  [16]),  which  uses  synthetic  workloads  or  outdated  traces 
such  as  [20],  [21].  The  performance  evaluation  results  show 
that  our  approach  closely  supports  the  desired  per-stream 
queue  length  set-point.  The  per-stream  queue  length  of  our 
approach  is  up  to  two  orders  of  magnitude  shorter  than  the 
one  supported  by  Borealis  [16].  In  addition,  our  approach 
considerably  decreases  the  delay  for  query  processing. 

Notably,  our  approach  is  lightweight.  It  consumes  less  than 
1%  CPU  utilization  and  1KB  of  memory  to  store  the  if-then 
rules  for  systematic  load  shedding.  It  has  little  overhead,  since 
it  is  localized  and  requires  no  global  feedback  or  adaptation 
across  the  entire  query  graph  (also  called  a  query  network 
in  the  DSMS  literature).  Further,  it  only  requires  simple 
computations  and  look-ups  of  the  fuzzy  rules  stored  in  two 
tables  to  compute  the  tuple  drop  probability  for  load  shedding. 

The  rest  of  the  paper  is  organized  as  follows.  An  overview 
of  the  DSMS  architecture  and  load  shedding  in  DSMS  is 
given  in  Section  II.  A  description  of  our  load  shedding  scheme 
and  fuzzy  rule  design  is  given  in  Section  III.  Performance  of 
our  approach  is  evaluated  in  Section  IV.  Section  V  discusses 
related  research.  Finally,  Section  VI  concludes  the  paper  and 
discusses  future  work. 

II.  System  Architecture 

In  a  distributed  DSMS  application,  the  stream  sources 
provide  input  data  streams  to  the  rest  of  the  nodes  and  the 
final  output  is  produced  at  one  or  more  sink  nodes  as  shown  in 
Figure  1.  CQ  operators,  e.g.,  windowed  select,  project,  or  join, 
process  data  streams  and  produce  intermediate  data  streams. 
A  query  is  a  tree  of  CQ  operators.  One  or  more  queries  can 
be  linked  together  to  form  a  query  graph/network  as  shown 
in  Figure  1.  In  DSMS,  a  queue  is  usually  assigned  to  buffer 
data  arriving  from  an  individual  data  stream  [4],  [22],  [5].  The 
stream  queues  in  one  node  of  a  distributed  DSMS  usually  share 
a  single  memory  pool  in  the  node  without  capping  the  length 
of  an  individual  queue.  Thus,  a  sudden  arrival  rate  increase  in 
a  single  stream  can  cause  the  other  operators  to  starve  due  to 
the  insufficient  memory.  As  a  result,  the  overall  response  time 
may  increase  due  to  the  data  and  control  dependencies  in  the 
query  network. 


To  prevent  a  subset  of  streams  from  consuming  exces¬ 
sive  amounts  of  memory,  we  directly  control  the  backlog 
accumulated  in  each  stream  queue  to  be  below  a  specified 
threshold,  Qs,  even  in  the  presence  of  bursty  stream  data 
arrivals  and  aperiodic,  data-dependent  executions  of  query 
operators.3  Since  the  arrival  rate  and  selectivity  may  vary  from 
stream  to  stream  and  operator  to  operator,  our  approach  to  load 
shedding  performs  on  a  per-stream,  per-CQ  operator  basis. 


Fig.  2.  Adaptive  Load  Shedding 


Figure  2  shows  the  high  level  structure  of  our  adaptive  load 
shedding  scheme.  Let  Itj  represent  stream  j  flowing  into  CQ 
operator  i  in  a  distributed  DSMS.  In  Figure  2,  qrj  represents 
the  instantaneous  length  of  the  queue  used  to  buffer  /y  ’s  tuples 
and  Qij{k)  represent  the  size  of  the  queue  at  the  kth  control 
event  that  occurs  when: 

\qij(k)  -  qij(k  —  1)|  >  B  (1) 

where  k  >  1  and  B  is  the  control  event  threshold  used  to 
detect  a  noteworthy  increase  or  decrease  of  q,j .  (We  assume 
that  qij{ 0)  =  0  \/i,j  when  the  system  starts  initially.)  We 
take  an  event-driven  approach,  since  it  is  infeasible  to  find 
one  common  control  period  effective  to  monitor  and  adapt  all 
streams,  which  may  have  significantly  different  average  and 
transient  arrival  rates. 

To  control  the  length  of  each  stream  queue  to  be  below 
the  specified  set-point  Qs,  we  employ  two  interconnected 
controllers:  a  queue  length  control  module  (QCM)  and  a 
selectivity-based  control  module  (SCM)  in  Figure  2.  Via  fuzzy 
control,  QCM  computes  the  drop  probability  d!l:j  (k).  SCM 
computes  actually  used  to  probabilistically  drop  tuples 

of  I-ij ,  if  necessary,  to  support  Qs.  To  compute  d,:i  (k),  SCM 
adjusts  (k)  considering  the  selectivity  of  the  preceding  CQ 
operator,  which  pushes  stream  data  tuples  into  /,j’s  queue. 

Whenever  a  control  event  occurs  for  an  arbitrary  stream  in 
a  DSMS,  we  apply  our  load  shedding  method  to  the  specific 
stream.  Therefore,  in  the  rest  of  this  paper,  we  consider  only 
a  single  stream  Itj  to  present  our  per-stream  load  shedding 
scheme.  We  omit  the  operator  and  stream  indexes  i  and  j  for 
brevity  without  loss  of  generality. 

3For  the  clarity  of  presentation,  we  use  a  system-wide  queue  length  bound 
Qs  that  applies  to  every  data  stream  queue  in  a  DSMS.  However,  our 
approach  is  not  limited  to  a  single  queue  length  bound.  Different  streams 
may  have  different  queue  length  bounds. 


At  the  kth  (control)  event,  i.e.,  when  \q(k)  —  q(k  —  1) |  >  B, 
QCM  computes  the  error: 

e{k)  =  ( Qs  -  q(k))/Qs.  (2) 

At  the  kth  event,  QCM  in  Figure  2  measures  the  time 
between  two  consecutive  events: 

At(k)  =  t(k)  —  t(k  —  1)  (3) 

where  t(k)  indicates  the  time  at  which  the  kth  event  occurs. 

QCM  also  computes  the  queue  length  change  between  the 
consecutive  events: 


A q{k)  =  q{k)  -  q(k  -  1).  (4) 


Using  Eq  3  and  Eq  4,  QCM  calculates  the  trend  of  the 
queue  length  change  representing  the  speed  and  direction  of 
the  queue  length  change: 


/(*)  = 


A t(k)  if  A q(k)  >  0; 

—At(k)  if  A q(k)  <  0 


(5) 


Note  that  A q{k)  ^  0  when  f(k)  is  computed,  since  a  control 
event  occurs  only  when  the  queue  length  changes  by  more  than 
B  as  defined  in  Eq  1.  By  considering  the  trend  in  addition  to 
the  error,  we  intend  to  improve  the  effectiveness  of  per-stream 
backlog  management  in  DSMS.  For  example,  even  when  the 
queue  length  is  longer  than  Qs,  it  is  unnecessary  to  increase 
the  tuple  drop  probability  by  a  large  amount,  if  the  queue 
length  is  decreasing  fast  or  vice  versa.  Logical  observations 
such  as  these  ones  can  be  easily  formulated  via  fuzzy  rules. 
Based  on  e(k)  and  / (k).  QCM  computes  the  drop  probability 
d! (k)  using  the  fuzzy  rules  (described  in  Section  III). 

As  shown  in  Figure  2,  the  output  of  QCM,  d!{k),  is  an  input 
to  SCM  that  adapts  d'{k),  if  necessary,  to  further  enhance  the 
effectiveness  of  load  shedding  by  considering  the  selectivity. 
At  the  kth  event,  SCM  in  Figure  2  computes  the  selectivity 
change  of  the  preceding  operator  as  follows: 


A  s(k)  =  s(k)  —  s(k  —  1)  (6) 

In  this  equation,  s(k)  is  the  selectivity  of  the  preceding  oper¬ 
ator  computed  at  the  kth  event.  Thus,  0  <  s(k)  <  1.  Based 
on  A s(k),  SCM  modifies  d'(k),  if  necessary,  to  expedite  the 
convergence  of  the  queue  length  to  Qs  based  on  the  selectivity 
information.  For  example,  A s(k)  <  0,  if  the  selectivity  of  the 
operator  pushing  a  data  stream  into  the  queue  is  decreasing. 
In  this  case,  SCM  adjusts  the  drop  probability  to  shed  fewer 
incoming  stream  tuples  by  making  d(k)  <  d'(k)  (or  vice 
versa).  In  this  way,  it  intends  to  avoid  unnecessary  data  losses 
by  dropping  just  enough  tuples  needed  to  closely  support  Qs. 


III.  Application  of  Fuzzy  Control  to  Manage 
Per-Stream  Backlog 

Figure  3  shows  the  structure  of  our  closed-loop  load  shed¬ 
ding  scheme  consisting  of  QCM  and  SCM.  In  fuzzy  control, 
crisp  variables  expressed  as  real  numbers,  e.g.,  e(k)  and 
f(k)  in  Figure  3,  are  mapped  to  linguistic  variables,  e.g., 
the  error  and  trend,  to  naturally  express  human  thoughts  via 
the  process  called  fuzzification.  Using  the  linguistic  variables, 


Fig.  3.  Load  Shedding  via  Fuzzy  Logic  Control 


the  inference  scheme  looks  up  the  relevant  if-precedent-then- 
consequent  rules.  Finally,  the  defuzzification  interface,  such 
as  the  one  in  Figure  3,  derives  a  crisp  control  signal  from  the 
linguistic  rules.4 

In  the  rest  of  this  section,  we  describe  the  key  observations 
that  motivated  the  design  of  our  fuzzy  if-then  rules  for  effective 
per-stream  load  shedding.  Also,  we  describe  the  detailed 
procedure  for  fuzzy  control,  while  giving  illustrative  examples. 

A.  Design  of  Fuzzy  Rules  for  Adaptive  Load  Shedding 


Fig.  4.  Queue  Length  Variations 
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Fig.  5.  Rule-base  of  QCM  used  to  derive  d!  (/.:)  from  e(k)  and  / ( k ) 
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Fig.  6.  Rule-base  of  SCM  used  to  derive  d,{k )  from  A s(k)  and  d! (k) 

For  effective  fuzzy  control,  it  is  essential  to  design  an 
effective  rule-base.  They  are  designed  based  on  the  logical 
understanding  of  the  characteristics  of  the  controlled  system 
such  as  a  DSMS.  Figures  5  and  6  are  the  rule-bases  used 

4The  fuzzification/defuzzification  interface,  inference  scheme,  and  rule-base 
of  SCM  are  not  shown  in  Figure  3  to  avoid  a  graphical  repetition. 


by  QCM  and  SCM,  respectively.  The  rules  are  designed  to 
support  Qs  even  when  the  system  status  and  workloads  change 
dynamically.  For  example,  the  rule  {NS,  FI}  — >  DM  in 
Figure  5  requires  the  load  shedder  to  drop  most  (DM)  tuples  of 
the  stream  (fj),  if  the  current  error,  i.e.,  the  linguistic  version 
of  e(k),  is  negative  small  (NS),  i.e.,  e(k)  >  Qs  by  a  small 
amount.  However,  the  trend,  i.e.,  the  linguistic  version  of  f(k), 
is  fast  increasing  (FI),  i.e.,  f{k)  3>  0. 

We  design  the  rules  in  Figure  5  based  on  our  observations 
depicted  in  Figure  4.  In  the  figure,  there  are  four  major  states 
that  characterize  dynamic  queue  length  changes.  In  normal 
conditions,  it  is  desired  that  the  system  is  in  states  1  or  4  in 
Figure  4.  Queue  length  overshoots  are  observed  in  states  2 
and  3;  that  is,  the  queue  length  is  longer  than  Qs.  Also,  state 
2  is  worse  than  state  3,  since  the  queue  length  is  diverging 
from  Qs  in  state  2,  whereas  it  is  converging  to  Qs  in  state 
3.  The  objective  of  QCM  is  to  avoid  overshoots  by  adapting 
the  drop  probability  based  on  the  error  and  trend,  which  are 
the  linguistic  (i.e.,  fuzzified)  versions  of  Eq  2  and  Eq  5.  More 
specifically,  we  design  the  rule-base  in  Figure  5  for  QCM 
considering  the  four  states  in  Figure  4  as  follows. 

State  1.  e(k)  >  0  and  f(k)  >  0:  In  this  state,  the  queue  is  not 
longer  than  Qs,  but  it  is  increasing.  Thus,  the  linguistic  error 
is  zero  or  positive,  while  the  trend  is  increasing. 

•  Case  1:  The  error  is  zero.  In  this  case,  the  queue  length 
is  equal  to  Qs.  If  the  trend  indicates  a  fast  increase  (FI) 
or  medium  (MI)  of  the  error,  we  drop  some  (DS)  tuples 
as  shown  in  in  Figure  5.  On  the  other  hand,  we  drop  few 
(DF)  tuples,  if  the  trend  is  a  small  increase  (SI). 

•  Case  2:  The  error  is  positive.  In  this  case,  the  queue  is 
shorter  than  Qs.  As  this  is  a  desirable  state,  we  either 
drop  few  (DF)  tuples  or  drop  none  (DN).  If  the  error  is 
positive  small  (PS),  i.e.,  the  queue  is  shorter  than  Qs  by 
a  small  amount,  and  the  trend  is  FI  or  MI,  we  drop  few 
(DF)  tuples.  Otherwise,  the  rules  recommend  the  DSMS 
to  drop  none  (DN). 

State  2.  e(fc)  <  0  and  f{k)  >  0:  In  this  zone,  the  queue  is 
longer  than  Qs  and  further  increasing.  To  cancel  the  overshoot 
and  divergence  from  Qs,  QCM  recommends  the  DSMS  to  drop 
the  largest  fraction  of  tuples  in  this  state.  It  derives  the  tuple 
drop  probability  considering  the  magnitudes  of  the  error  and 
trend.  For  example,  when  the  error  is  negative  medium  (NM), 
the  queue  is  longer  than  Qs  by  a  medium  amount.  In  this 
case,  the  rules  {NM,  SI}  —>  DS  and  {NM,  MI}  — >  DM 
in  Figure  5  require  the  DSMS  to  drop  some  (DS)  and  drop 
most  (DM)  tuples  for  the  slow  increase  and  medium  increase 
of  the  queue  length,  respectively. 

State  3.  e(k)  <  0  and  / ( k )  <  0:  In  this  zone,  the  queue 
length  is  longer  than  or  equal  to  Qs,  but  it  is  decreasing. 
Control  rules  should  be  carefully  designed  by  comparing  the 
magnitudes  of  the  error  and  trend,  since  it  is  possible  for  them 
to  cancel  each  other.  For  example,  if  the  error  is  negative  small 
(NS)  and  the  queue  length  shows  a  slowly  decreasing  (SD) 
trend,  no  tuple  will  be  dropped  as  shown  in  Figure  5.  In  this 
region,  a  higher  drop  probability  is  derived  for  a  larger  absolute 


value  of  the  error  and  a  slower  decrease  of  the  queue  length 
or  vice  versa. 

State  4.  e(fc)  >  0  and  f[k)  <  0:  In  this  state,  the  actual  queue 
length  is  shorter  than  the  set-point  and  it  is  further  decreasing. 
Since  this  is  a  desirable  state,  the  QCM  requires  the  DSMS 
to  drop  none  of  the  tuples. 

SCM  aims  to  further  decrease  the  possibility  of  a  queue 
length  overshoot.  To  achieve  the  objective,  SCM  uses  the  rule- 
table  in  Figure  6  to  derive  d(k)  based  on  A s(k)  and  d! (k) 
computed  by  QCM.  More  specifically,  SCM  amplifies  d'  (k) 
only  when  the  selectivity  is  increasing,  i.e.,  A s(k)  =  s(k)  — 
s(k  —  1)  >  0.  In  this  case,  SCM  amplifies  d'(k)  based  on  the 
magnitude  of  A s(k)  and  d'(k).  Thus,  the  rule-base  in  Figure  6 
only  considers  positive  selectivity  changes. 

For  example,  suppose  that  the  selectivity  change  is  positive 
large  (PL),  but  QCM  produces  a  drop  none  (DN)  signal.  In 
this  case,  the  rule  {PL,DN}  — >  DS  in  Figure  6  modifies 
the  DN  signal  produced  by  the  QCM  to  the  drop  some  (DS) 
signal,  because  the  preceding  operator’s  selectivity  is  increas¬ 
ing  largely  and  the  length  of  the  stream  queue  succeeding  the 
operator  is  likely  to  increase  as  a  result. 

Notably,  the  first  two  columns  in  Figure  6  require  the 
system  to  further  increase  the  drop  probability  considering  the 
selectivity  increase.  On  the  other  hand,  the  last  two  columns 
require  no  further  increase  of  the  drop  probability,  since  the 
probability  computed  by  QCM  is  already  high  enough  in 
these  cases.  In  this  way,  SCM  aims  to  avoid  overshoots  and 
excessive  data  losses  simultaneously. 

B.  Fuzzy  Control  Procedure 

In  this  section,  the  fuzzification,  inference,  and  defuzzifi¬ 
cation  process  for  QCM  is  described.  The  procedure  used  by 
SCM  is  similar  and  omitted  due  to  space  limitations. 

At  the  kth  control  event,  the  fuzzification  interface  of  QCM 
in  Figure  3  maps  e(k)  and  f{k)  to  the  linguistic  variables  of 
the  error  and  trend  with  the  corresponding  linguistic  values. 
Using  the  derived  linguistic  values,  the  inference  mechanism 
in  in  Figure  3  looks  up  the  rule-base  to  find  relevant  if-then 
rules.  From  the  relevant  rules,  the  defuzzification  interface  of 
QCM  derives  a  crisp  control  signal  d'(k ),  via  defuzzification, 
to  support  the  desired  per-stream  backlog  bound  Qs. 

For  fuzzy  control,  each  variable,  such  as  the  variables  in 
Figures  7(a)  —  7(c),  should  be  associated  with  the  universe  of 
discourse,  i.e.,  the  domain,  [9].  In  this  paper,  the  universe  of 
discourse  for  e(k)  and  A s(k)  is  [-0.5,  0.5],  while  the  universe 
of  discourse  for  f{k)  is  [-3.5,  3. 5]. 5  For  the  control  output, 
i.e.,  d'(k)  and  d(k),  we  use  the  universe  of  discourse  of  [0, 
0.95]  to  bound  the  range  of  the  drop  probability. 

In  fuzzy  set  theory  underlying  fuzzy  control  theory,  to 
deal  with  uncertainties,  a  crisp  value  is  mapped  to  one  or 
more  fuzzy  sets;  that  is,  partial  membership  is  allowed  unlike 
set  theory  that  only  supports  binary  membership  [9].  The 
degree  of  membership,  called  certainty,  is  computed  using 

5If  e(k)  or  A s(k)  is  smaller  than  -0.5  (larger  than  0.5),  it  belongs  to  NL 
(PL)  with  certainty  1.  Similarly,  if  f(k)  <  —3.5  (f(k)  >  3.5).  it  belongs  to 
SD  (SI)  with  certainty  1. 


membership  functions  (MFs).  In  this  paper,  e(k)  may  belong 
to  one  or  two  fuzzy  MFs  in  Figure  7(a),  because  we  design 
adjacent  MFs  to  overlap.  For  all  the  MFs  in  Figure  7(a), 
except  for  the  leftmost  and  rightmost  ones,  we  use  symmetric 
triangles  of  an  equal  base  and  50%  overlap  with  the  adjacent 
fuzzy  MFs.  In  Figure  7(a),  the  horizontal  axis  represents 
e(fc),  while  the  vertical  axis  indicates  the  certainty  of  the 
membership. 

Fuzzification  is  the  first  step  in  fuzzy  control.  It  maps  the 
input  crisp  variables  to  the  corresponding  linguistic  variables. 
In  this  paper,  e{k)  measured  at  the  kth  control  event  are 
fuzzified  to  the  linguistic  error  with  one  or  two  linguistic 
values  in  Figure  7(a):  Negative  Large,  Negative  Medium, 
Negative  Small,  ZEro,  Positive  Small,  Positive  Medium,  and 
Positive  Large.  If  e{k)  <  0,  it  is  mapped  to  one  or  more 
Negative  MFs  in  Figure  7(a)  or  vice  versa. 

Via  fuzzification,  we  also  map  f(k)  to  the  linguistic  trend 
with  one  or  two  of  the  linguistic  values  in  Figure  7(b):  Slow 
Decrease,  Medium  Decrease,  Fast  Decrease,  Slow  Increase, 
Medium  Increase,  and  Fast  Increase.  If  f(k)  <  0,  it  is  mapped 
to  one  or  two  Decrease  MFs  in  Figure  7(b)  or  vice  versa 
(Eq  5).  In  the  following  example,  QCM  fuzzifies  e{k)  and 

m. 

Example  1.  Fuzzification.  Let  e(k)  =  —0.15625  and 
f(k)  =  1.5.  In  this  case,  the  error,  i.e.,  the  linguistic  version 
of  e{k),  belongs  to  NM  and  NS  in  Figure  7(a)  with  certainty 
Mjvm(— 0.15625)  =  0.25  and  ^^(-0.15625)  =  0.75.  At 
the  same  time,  the  trend,  i.e.,  the  linguistic  version  of  f{k), 
belongs  to  MI  in  Figure  7(b)  with  certainty  1. 

Using  the  fuzzified  linguistic  error  and  trend,  the  inference 
mechanism  of  QCM  looks  up  the  rule-base  in  Figure  5  to  find 
the  related  if-then  rules.  The  linguistic  variables  are  used  to 
infer  the  if-then  rules.  Due  to  partial  membership  allowed  in 
fuzzy  control,  more  than  one  if-then  rules  can  be  triggered  as 
a  result  of  the  rule-base  look-up.  In  this  case,  the  certainty  of 
an  inferred  rule  is  set  to  the  minimum  of  the  certainty  values  of 
the  precedents,  because  the  consequent  cannot  be  more  certain 
than  the  precedents  [9], 

Example  2.  Rule-Base  Lookup.  In  Example  1,  the  error 
belongs  to  NS  and  NM  with  the  certainty  0.75  and  1.  Further, 
the  trend  belongs  to  MI  with  the  certainty  1.  In  this  case,  two 
rules  in  Figure  5,  i.e.,  {NS,  MI}  — >  DS  and  {NM,  MI}  — > 
DM,  are  relevant.  Since  the  error  and  trend  belong  to  NS 
and  MI  with  the  certainty  0.75  and  1,  the  certainty  of  the 
first  rule  {NS,  MI}  — >  DS  is:  min{0.75,l}  =  0.75.  At  the 
same  time,  the  error  and  trend  belong  to  NM  and  MI  with  the 
certainty  0.25  and  1 ;  therefore,  the  certainty  of  the  second  rule 
{NM,MI}  ->  DM  is:  min{0.25,l}  =  0.25. 

Finally,  the  defuzzification  interface  derives  crisp  d'  (k)  from 
the  rules.  For  defuzzification,  QCM  and  SCM  use  the  MFs  in 
Figure  7(c):  Drop  None,  Drop  Few,  Drop  Some,  and  Drop 
Most.  To  describe  the  defuzzification  process,  let  x  and  y 
represent  the  row  and  column  indexes  in  Figure  5  (1  <  x  <  7 
and  1  <  y  <  6).  Also,  let  p(x,  y)  denote  the  certainty  of 
the  corresponding  rule(x,  y )  in  the  table.  Further,  let  c(x,y ) 
denote  the  center  of  the  fuzzy  set  that  is  the  rule(x,y)’ s 


(a)  MFs  used  to  fuzzify  e(k)  or  A s(k)  (b)  MFs  used  to  fuzzify  f(k)  (c)  MFs  used  to  derive  d! (k) 

and  dik) 


Fig.  7.  Membership  Functions  (MFs)  for  Fuzzification  and  Defuzzification 


consequent.  For  a  triangular  fuzzy  set,  the  center  is  the  middle 
of  the  triangle’s  base.  In  this  paper,  we  use  the  centroid 
method  [9]  to  compute  the  drop  probability  for  per-stream 
queue  length  control: 


d'(k) 


-c(x,y) 


(7) 


Example  3.  Defuzzification.  In  Example  2,  two  control 
signals  DS  and  DM  are  derived  with  certainty  values  0.75  and 
0.25,  respectively.  According  to  Figure  7(c),  the  center  of  DS 
is  0.475  and  the  center  of  DM  is  0.712.  Using  the  certainty 
values  of  the  rules  from  the  previous  example,  we  compute  the 
drop  probability:  d'(k)  =  0'75'004^+°'^'°'712  =  0.534.  Thus, 
this  stream’s  tuples  will  be  randomly  dropped  with  probability 
0.534  until  the  next  control  event  happens,  unless  SCM  further 
adjusts  the  drop  probability  based  on  the  selectivity  change 
A  s(k). 


IV.  Performance  Evaluation 
In  this  section,  a  description  of  the  experimental  settings  is 
followed  by  the  performance  evaluation  results. 


Source 
Level  1 
Level  2 
Sink 

Fig.  8.  Query  network  topology  for  experiments 

For  our  experiments,  we  use  CAIDA’s  Internet  traffic  traces 
collected  by  the  equinix-sanjose  high-speed  monitor  [17]. 
The  average  bit-rate  of  the  monitored  network  is  around  2.3 
Gbps.  Approximately  500,000  packets  on  average  are  being 
processed  every  second  by  each  of  the  micro-benchmark  ap¬ 
plications  that  we  have  used  for  performance  evaluation.  Thus, 
our  applications  model  the  challenging  problem  of  monitoring 
bursty  and  highly  dynamic  Internet  traffic  patterns. 

Figure  8  shows  the  query  network  topology  used  in  our 
experiments.  The  input  data  streams  flow  from  the  sources  and 
processed  in  the  four  layers  to  generate  the  query  results  at  the 
sink.  The  15  physical  machines  in  the  figure  are  connected  via 
a  1  Gbps  switch.  Each  machine  has  a  1.5  GHz  or  2.2  GHz  Intel 


dual  core  CPU  and  1GB  main  memory.  In  our  experiments, 
the  stream  sources  use  the  faster  machines  to  generate  high- 
rate  data  streams.  Input  data  are  stored  in  raw  packet  format 
and  partitioned  to  eight  source  level  machines  based  on  their 
destination  addresses,  modeling  network  monitoring  sensors 
placed  at  different  locations.  Sources  read  blocks  of  the  traces, 
transform  the  data  into  Borealis  tuples,  and  inject  them  into 
the  input  streams  of  the  query  network.  The  8  query  chains 
in  Figure  8  are  executed  in  parallel.  For  each  query  operator, 
we  use  a  jumping  window  of  1000  tuples.  Thus,  an  individual 
stream  query  operator  is  executed  for  every  1000  tuple  arrivals. 

Table  I  shows  the  flow  statistics  (FS)  application.  In  FS, 
the  flow  information,  including  source  and  destination  ad¬ 
dresses  as  well  as  port  numbers,  is  extracted  from  the  TCP/IP 
packets  to  continuously  analyze  flow  statistics  for  the  source- 
destination  pairs  found  in  the  jumping  window.  (For  brevity, 
the  window  constraint  is  not  included  in  the  table.)  The 
continuous  queries  in  the  table  are  deployed  at  the  specified 
levels  in  Figure  8  to  analyze  network  traffic  patterns  based 
on  flow  IDs.  The  sink  node  computes  a  running  average  and 
reports  flows  whose  rates  are  higher  than  the  value  specified  in 
Table  I.  Although  we  have  done  experiments  for  other  network 
analysis  applications,  we  mainly  report  the  results  of  FS  due 
to  space  limitations. 

In  this  paper,  we  compare  the  performance  of  our  approach, 
called  FuzzyQS,  to  that  of  Borealis,  which  employs  FIT  [16] 
for  load  shedding,  mainly  in  terms  of  the  queue  length  and  CQ 
processing  delay.  At  every  second,  we  record  the  maximum 
queue  length  measured  in  the  query  network.  In  this  paper,  we 
set  Qs  =  1000  tuples  for  performance  evaluation  purposes. 
For  performance  evaluation,  we  keep  track  of  the  biggest 
queue  size  in  the  whole  distributed  DSMS  at  every  second. 
In  addition,  the  largest  source-to-sink  query  processing  delay 
is  measured  every  Is.  Since  our  approach  avoids  overloads  via 
load  shedding  on  a  per-stream  and  per-operator  basis,  they  can 
indirectly  reduce  the  delay  too. 

Figure  9  shows  the  cumulative  distribution  of  the  queue 
length  for  10  experimental  runs  of  the  FS  application.  As 
shown  in  the  figure,  the  95  percentile  queue  length  of  Borealis 
and  FuzzyQS  is  59435  and  857  tuples,  respectively.  Figure  10 
shows  the  transient  queue  length  measured  every  Is  for  1  run. 
FuzzyQS  controls  the  per-stream  queue  length  to  be  below 
Qs  =  1000  tuples  for  most  of  the  time  except  for  a  few 


Level 


DSMS  Application 


Flow  Summary  (FS) 


Source 


Level  1 
Level  2 
Sink 


Query 

select  *  from  (select  count (*) ,  sum (size) , flowid  from  (select  [flow  fields] 
from  input)  group  by  flowid)  where  count  >  5 

select  *  from  (select  sum(count) ,  sum(size) ,  flowid  from  (select  *  from 

input 1, input 2)  group  by  flowid)  where  count  >  10 

select  *  from  (select  sum(count) ,  sum(size) ,  flowid  from  (select  *  from 

input 1, input 2)  group  by  flowid)  where  count  >  20 

select  *  from  (select  sum(count) ,  sum(size) ,  flowid  from  (select  *  from 

input 1, input 2)  group  by  flowid)  where  size  >  avg(size)  /  2 


TABLE  I 

Flow  analysis  application  deployed  over  the  query  network  in  Figure  8 


Tuples 


Fig.  9.  Flow  Summary:  Cumulative  Queue  Length  for  10  Runs 
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Fig.  10.  Flow  Summary:  Transient  Queue  Length  for  1  Run 

overshoots  as  shown  in  Figure  10.  From  Figures  9  and  10, 
we  observe  that  the  per-stream  queue  length  of  FuzzyQS  is 
up  to  two  orders  of  magnitude  shorter  than  that  of  Borealis. 
By  controlling  the  per-stream  backlog,  FuzzyQS  considerably 
decreases  the  delay  too.  In  Figure  11,  the  95  percentile  delay 
of  Borealis  and  FuzzyQS  for  10  runs  is  8.86s  and  5.26s, 
respectively.  Thus,  the  95  percentile  delay  of  our  approach 
is  approximately  40%  shorter  than  that  of  Borealis.  FuzzyQS 
achieves  these  improvements  via  per-stream,  fully-distributed 
queue  length  control  based  on  fuzzy  rules.  Borealis  makes 
system-wide  decisions  on  load  shedding  at  the  nodes  near 
the  sinks.  The  decisions  are  aggregated  and  cascaded  back 
toward  the  sources  in  an  incremental  manner  [16].  Notably, 
existing  DSMS,  including  [1],  [2],  [3],  [5],  [4],  [6],  [18],  [19], 
[13],  [14],  [15],  [8],  [16],  do  not  aim  to  support  the  desired 
per-stream  queue  length  even  in  the  presence  of  dynamic 
workloads.  In  the  future,  we  will  investigate  how  to  further 
reduce  per-stream  queue  length  overshoots. 

Figure  12  shows  the  cumulative  output  rate  for  10  runs, 
which  measures  the  number  of  the  output  tuples  leaving  the 
sink  per  second.  From  the  figure,  we  observe  that  the  output 


Fig.  11.  Flow  Summary:  Cumulative  Delay  for  10  Runs 


rate  of  FuzzyQS  is  similar  to  the  rate  of  Borealis.  Thus,  our 
approach  does  not  simply  drop  more  tuples  than  Borealis  does 
to  control  the  queue  length.  Instead,  it  judiciously  adapts  the 
drop  rate  of  each  stream,  if  necessary,  to  avoid  forming  a 
bottleneck  in  the  query  chains. 

In  addition  to  FS,  we  have  designed  and  run  two  more 
applications  to  analyze  network  traffic  patterns  based  on 
the  source  and  destination  IP  address,  respectively.  In  these 
experiments,  FuzzyQS  has  closely  supported  Qs.  The  queue 
length  of  Borealis  is  up  to  an  order  of  magnitude  longer  that 
that  of  FuzzyQS.  The  response  time  of  Borealis  is  longer  than 
ours  by  0.63s  -  1.2s.  Also,  FuzzyQS  has  shown  simliar  output 
rates  to  those  provided  by  Borealis.  Detailed  results  for  these 
DSMS  applications  are  not  included  due  to  space  limitations. 

Moreover,  our  approach  incurs  little  overhead.  In  our  exper¬ 
iments,  our  load  shedding  scheme  is  activated  approximately 
80  times/s  on  average,  consuming  approximately  0.25%  total 
CPU  utilization.  It  consumes  less  than  1KB  of  memory  in  each 
node  to  store  the  two  rule-bases  and  keep  track  of  the  queue 
lengths  and  operator  selectivities. 


Fig.  12.  Flow  Summary:  Cumulative  Output  Rates  for  10  Runs 


V.  Related  Work 

Load  shedding  techniques  have  been  studied  to  enhance 
the  performance  and  reliability  of  DSMS  [13],  [14],  [8], 
[16].  Under  overload,  the  load  shedder  of  Aurora  drops  the 
tuple  with  the  lowest  utility  first  under  the  assumption  that 
the  utilities  of  streams  are  known  a  priori  [8].  In  Borealis, 
global  load  shedding  decisions  are  made  near  the  sink  and 
propagated  back  toward  the  stream  sources  [16].  Data  Triage 
does  load  shedding  under  overload,  while  generating  short 
summaries  of  the  discarded  packets  [13].  This  approach  is 
complementary  to  ours  in  that  it  could  be  combined  with 
our  approach  to  support  a  summary  of  dropped  tuples  in  the 
background.  Overall,  our  work  is  different  in  that  it  supports 
fully  localized,  per-stream  backlog  management  by  leveraging 
continuous  query  semantics,  while  applying  well-established 
fuzzy  control  techniques. 

A  proportional  and  integral  (PI)  controller  is  designed  to 
bound  the  average  per- tuple  delay  via  load  shedding  [14],  It 
has  also  been  used  to  support  real-time  periodic  queries  [18], 
[19],  [23]  implemented  in  STREAM  [5].  However,  a  PI 
controller  may  largely  fail  to  support  the  desired  performance, 
if  the  model  of  the  controlled  system,  e.g.,  a  DSMS,  derived 
offline  using  representative  workloads  becomes  inaccurate 
due  to  dynamic  workloads  and  system  behaviors  [10].  Un¬ 
fortunately,  finding  representative  workloads  for  a  highly- 
dynamic,  nonlinear  systems,  such  as  DSMS,  is  very  hard  (if 
at  all  possible).  Applying  other  branches  of  control  theoretic 
techniques  based  on  mathematical  modeling  of  the  controlled 
system,  such  as  [11],  [12],  also  suffers  from  the  difficulty  of 
accurately  modeling  highly-dynamic  and  nonlinear  DSMS. 

Fuzzy  logic  control  has  been  applied  to  manage  the  per¬ 
formance  of  other  computational  systems  such  as  real-time 
tracking  [24].  Basaran  et  al.  [25]  show  that  fuzzy  control 
considerably  outperforms  a  PI  controller  and  a  model  pre¬ 
dictive  controller  in  terms  of  CPU  utilization  control  in  a 
real-time  operating  system.  However,  these  approaches  do  not 
aim  to  support  load  shedding  considering  distributed  DSMS 
semantics  discussed  in  Section  I.  Nor  do  they  support  event- 
driven  control  to  manage  the  per-stream  backlog. 

VI.  Conclusion  and  Future  Work 

In  this  paper,  we  present  a  lightweight  load  shedding  scheme 
to  manage  the  per-stream  backlog.  Our  approach  directly 
considers  the  per-stream  backlog  and  selectivity  of  each  query 
operator  to  support  graceful,  fine-grained  load  shedding  in  a 
distributed  manner.  Further,  our  approach  is  event-driven.  It 
quickly  reacts  to  bursty  workloads,  while  avoiding  an  unnec¬ 
essary  activation  of  the  controller  under  light  load  conditions. 
The  performance  of  our  approach  is  evaluated  thoroughly 
using  high-rate  Internet  traces  by  extending  Borealis,  which 
is  an  advanced  open-source  distributed  DSMS.  From  our 
experiments,  we  observe  that  our  approach  can  closely  sup¬ 
port  the  desired  per-stream  queue  length,  while  considerably 
decreasing  the  per-stream  backlog  and  query  processing  delay 
compared  to  Borealis.  In  the  future,  we  will  explore  more 


effective  scheduling  and  load  shedding  schemes  to  further 

enhance  the  performance  of  distributed  DSMS. 
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Abstract 

In  this  paper,  we  present  a  new  MapReduce  framework,  called  Grex,  de¬ 
signed  to  leverage  general  purpose  graphics  processing  units  (GPUs)  for  par¬ 
allel  data  processing.  Grex  provides  several  new  features.  First,  it  supports  a 
parallel  split  method  to  tokenize  input  data  of  variable  sizes,  such  as  words  in 
e-books  or  URLs  in  web  documents,  in  parallel  using  GPU  threads.  Second, 
Grex  evenly  distributes  data  to  map/reduce  tasks  to  avoid  data  partitioning 
skews.  In  addition,  Grex  provides  a  new  memory  management  scheme  to  en¬ 
hance  the  performance  by  exploiting  the  GPU  memory  hierarchy.  Notably,  all 
these  capabilities  are  supported  via  careful  system  design  without  requiring  any 
locks  or  atomic  operations  for  thread  synchronization.  The  experimental  results 
show  that  our  system  is  up  to  12. 4x  and  4. lx  faster  than  two  state-of-the-art 
GPU-based  MapReduce  frameworks  for  the  tested  applications. 


1.  Introduction 

As  the  modern  GPUs  are  becoming  more  programmable  and  flexible,  their 
use  for  general  purpose  computation  is  increasing.  NVIDIA’s  CUDA  [1]  and 
ATI’s  OpenCL  [2]  support  general  purpose  GPU  programming.  Generally,  a 
GPU  has  a  significantly  larger  number  of  processing  elements  than  a  modern 
multicore  CPU  does,  providing  massive  hardware  parallelism.  For  example,  an 
NVIDIA  GeForce  8800  GPU  provides  128  cores  and  a  maximum  768  resident 
threads.  Also,  a  more  advanced  NVIDIA  Fermi  GPU  provides  512  cores  that  can 
accommodate  up  to  1536  threads.1  Especially,  GPUs  are  designed  to  support 
immense  data  parallelism;  that  is,  GPU  hardware  is  optimized  for  threads  to 
independently  process  different  chunks  of  data  in  parallel. 

MapReduce  [3]  is  a  popular  programming  model  designed  to  process  a  large 
amount  of  data  in  parallel.  A  user  developing  a  data  parallel  application  via 
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xIn  this  paper,  we  focus  on  NVIDIA  GPUs;  however,  the  principal  ideas  of  our  approach 
are  generally  applicable  to  the  GPUs  from  the  other  vendors.  Hereafter,  a  GPU  refers  to  an 
NVIDIA  GPU. 
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MapReduce  is  only  required  to  write  two  primitive  operations,  namely  map 
and  reduce  functions.  A  MapReduce  application  is  called  a  job.  Given  a  job, 
the  MapReduce  runtime  partitions  the  job  into  smaller  units  called  tasks.  It 
assigns  tasks  and  data  chunks  to  workers,  which  have  available  computational 
resources,  called  slots,  to  process  the  data  chunks  by  running  the  tasks.2  The 
workers  process  different  data  chunks  in  parallel  independently  from  each  other. 
The  workers  running  map  tasks  process  input  <key,  value>  pairs  to  generate 
intermediate  <key,  value>  pairs.  The  reduce  function  merges  all  intermediate 
<key,  value>  pairs  associated  with  the  keys  that  are  in  the  same  logical  group. 
As  the  runtime  handles  scheduling  and  fault  tolerance  in  a  transparent  manner, 
it  is  easy  for  a  user  to  write  data  parallel  applications  using  MapReduce. 

Given  that  both  GPUs  and  MapReduce  are  designed  to  support  vast  data 
parallelism,  it  is  natural  to  attempt  to  leverage  the  hardware  parallelism  pro¬ 
vided  by  GPUs  for  MapReduce.  Although  doing  this  may  seem  simple  on  the 
surface,  it  faces  a  number  of  challenges  especially  to  process  variable  size  data 
for  industrially  relevant  and  highly  data  parallel  applications,  such  as  natural 
language  processing  and  web  document  analysis: 

•  Most  GPUs  do  not  support  dynamic  memory  management.  Thus,  it  is 
challenging  to  process  variable  size  data  in  GPUs.  Furthermore,  the  max¬ 
imum  size  of  variable  length  data  is  often  unknown  a  priori  in  natural 
language  processing  and  web  document  analysis.  Hence,  existing  tech¬ 
niques  developed  to  deal  with  variable  size  data,  such  as  padding,  are  not 
directly  applicable.3 

•  While  processing  variable  size  data,  MapReduce  tasks  may  suffer  from  load 
imbalance.  To  process  variable  size  textual  data,  the  MapReduce  runtime 
sequentially  scans  the  entire  input  data  and  assigns  the  same  number  of 
lines  to  each  map  task  in  a  phase  called  input  split.  Unfortunately,  input 
split  is  not  only  slow  due  to  the  sequential  execution  but  also  subject  to 
potential  load  imbalance  among  the  map  tasks,  since  the  amount  of  data 
may  vary  from  line  to  line. 

•  In  the  reduce  phase  of  MapReduce,  an  entire  group  of  intermediate  data 
having  the  same  key  is  hashed  to  a  single  reduce  task,  incurring  severe  load 
imbalance  among  the  reduce  tasks  in  the  presence  of  data  skews  [5,  6].  For 
example,  certain  words  or  URLs  tend  to  appear  more  frequently  than  the 
others  in  natural  language  or  web  documents.  As  a  result,  a  reduce  task 
assigned  a  popular  key  group  is  likely  to  have  a  significantly  larger  number 
of  intermediate  <key,  value>  pairs  to  process  than  the  other  reducers  do. 
Load  imbalance  due  to  the  input  split  and  data  skews  may  adversely  affect 


2In  Hadoop  [4],  an  open  source  implementation  of  MapReduce  by  Yahoo,  a  slot  represents 
an  available  CPU  core  by  default. 

3  To  apply  padding,  the  entire  set  of  the  variable  size  data  has  to  be  preprocessed  to  find 
the  size  of  the  biggest  data  item  first.  After  that,  actual  padding  needs  to  be  performed. 
These  extra  steps  may  introduce  a  significant  overhead  when  the  data  set  is  large. 
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the  performance,  since  it  is  known  that  the  slowest  task  determines  the 
speed  of  a  job  that  consists  of  multiple  tasks  executed  in  parallel. 

•  GPU  memory  hierarchy  is  significantly  different  from  the  host  (i.e. ,  CPU 
side)  memory  hierarchy.  Since  the  number  of  the  cycles  consumed  by  the 
fastes  and  slowest  memory  in  the  hierarchy  is  different  by  several  orders  of 
magnitude,  it  is  challenging  but  important  to  efficiently  utilize  the  GPU 
memory  hierarchy. 

To  tackle  these  challenges,  we  design,  implement,  and  evaluate  a  new  GPU- 
based  MapReduce  framework  called  Grex  (meaning  a  herd  in  Greek) .  Our  goal 
is  to  efficiently  process  variable  size  data  for  the  entire  MapReduce  phases  be¬ 
ginning  from  input  split  to  reduction,  while  minimizing  potential  load  imbalance 
in  MapReduce.  At  the  same  time,  we  aim  to  efficiently  utilize  the  GPU  mem¬ 
ory  hierarchy,  which  is  important  for  general  MapReduce  applications  executed 
in  GPUs  to  process  either  variable  or  fixed  size  data.  A  summary  of  the  key 
contributions  made  by  Grex  towards  achieving  these  goals  follows: 

•  To  efficiently  handle  variable  size  data,  Grex  replaces  the  sequential  split 
phase  in  MapReduce  with  a  new  parallel  split  step.  In  parallel  split,  Grex 
evenly  distributes  input  data  in  terms  of  the  number  of  bytes  to  the  threads 
that  tokenize  input  data  in  parallel.4  Notably,  Grex  does  not  need  to  know 
the  size  of  any  token  in  advance  to  do  parallel  split. 

•  After  parallel  split,  each  thread  executes  the  user-specified  map()  function 
to  process  the  tokens  found  as  a  result  of  parallel  split.  After  finishing  to 
execute  the  map()  function,  each  thread  that  has  found  the  beginning  of  a 
token  computes  a  unique  address  to  which  an  intermediate  data  item  will 
be  written  without  communicating  with  the  other  threads.  Therefore,  no 
locks  or  atomics  for  synchronization  are  needed.  Similarly,  the  output  of 
the  reduce  phase  is  written  without  requiring  any  locks  or  atomic  opera¬ 
tions. 

•  Grex  assigns  an  equal  number  of  intermediate  <key,  value>  pairs  to  the 
reduce  tasks  to  execute  the  user-specified  reduce))  function  in  a  parallel, 
load-balanced  manner  even  if  the  keys  are  skewed. 

•  Although  MapReduce  deals  with  <key,  value>  pairs,  only  keys  are  needed 
for  a  considerable  part  of  data  processing  in  a  number  of  MapReduce 
applications.  In  word  count,  for  example,  keys  can  be  processed  by  map 
tasks  and  sorted  and  grouped  together  with  no  values.  After  grouping  the 
same  keys  together,  the  frequency  of  each  unique  key  has  to  be  computed. 


4In  this  paper,  a  meaningful  unit  of  variable  size  data,  e.g.,  an  English  word  or  URL,  is 
called  a  token.  Converting  raw  data,  e.g.,  natural  language  or  web  documents,  into  tokens  is 
called  tokenization.  For  fixed  size  input  data  such  as  integers  or  floats,  Grex  simply  assigns 
the  same  number  of  data  items  to  the  map  tasks  and  skips  the  parallel  split  step. 
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Thus,  it  is  not  necessary  to  generate  values  in  an  earlier  stage  in  certain 
applications.  Based  on  this  observation,  Grex  supports  a  new  feature 
called  lazy  emit.  Grex  provides  an  API  to  let  a  user  specify  the  MapReduce 
phase  of  a  specific  application  where  Grex  needs  to  create  values  and  emit 
complete  <key,  value>  pairs  to  the  GPU  DRAM,  called  global  memory. 
Before  the  user-specified  phase  begins,  Grex  does  not  generate  or  store  any 
values  but  only  deals  with  keys.  By  doing  this,  Grex  intends  to  decrease 
the  overhead  to  handle  intermediate  data  in  terms  of  computation  and 
memory  usage. 

•  Through  all  the  MapReduce  phases,  Grex  takes  advantage  of  the  GPU 
memory  hierarchy  to  decrease  the  delay  for  memory  access.  To  this  end, 
Grex  provides  an  API  by  which  a  user  (i.e.,  a  MapReduce  application 
developer)  can  specify  the  size  and  type  (read-only  or  read/write)  of  a 
buffer  that  will  be  stored  in  global  memory  and  cached  by  Grex  using 
faster  memory  of  the  GPU  in  a  transparent  manner. 

Although  a  number  of  GPU-based  MapReduce  systems  have  previously  been 
developed  [3,  4,  7,  8,  9,  10,  11,  12],  we  are  not  aware  of  any  other  GPU-based 
MapReduce  framework  that  supports  all  these  features. 

We  have  implemented  Grex  and  tested  our  implementation  on  a  PC  with 
an  AMD  quad-core  CPU  using  two  different  generations  of  NVIDIA  GPUs.  We 
have  compared  the  performance  of  Grex  to  that  of  Mars  [7]  and  MapCG  [8], 
which  are  open  source  GPU-based  MapReduce  frameworks.  For  performance 
evaluation,  we  have  used  six  common  applications  with  different  needs  in  terms 
of  MapReduce  computations,  i.e.,  string  match,  page  view  count,  word  count, 
inverted  index,  similarity  score,  and  matrix  multiplication.  The  first  four  appli¬ 
cations  deal  with  variable  size  data,  while  the  last  two  applications  process  fixed 
size  data.  By  considering  a  diverse  set  of  applications,  we  thoroughly  evaluate 
the  performance  of  Grex  in  comparisons  to  Mars  and  MapCG.  Our  performance 
evaluation  results  indicate  up  to  12. 4x  speedup  over  Mars  and  up  to  4. lx  speedup 
over  MapCG  using  the  GeForce  GTX  8800  GPU  and  GeForce  GTX  460  GPU 
that  supports  atomic  operations  needed  by  MapCG,  respectively. 

The  remainder  of  this  paper  is  organized  as  follows.  Related  work  is  dis¬ 
cussed  in  Section  2.  A  brief  description  of  MapReduce  and  the  general  GPU 
architecture  is  given  in  Section  3.  The  design  of  Grex  is  described  in  Section  4. 
Grex  code  snippets  are  discussed  in  Section  5  to  further  illustrate  the  key  con¬ 
cepts  introduced  in  this  paper.  The  performance  evaluation  results  are  discussed 
in  Section  6.  Finally,  Section  7  concludes  the  paper  and  discusses  future  work. 


2.  Related  Work 

MapReduce  is  developed  by  Google  [3] .  Hadoop  [4]  is  an  open-source  imple¬ 
mentation  of  MapReduce  developed  by  Yahoo.  Both  MapReduce  and  Hadoop 
are  designed  for  commodity  servers.  Phoenix  [9]  and  its  extensions  [10,  13] 
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are  one  of  the  first  projects  to  implement  a  MapReduce  framework  for  a  multi¬ 
core  architecture  with  shared  memory.  Ostrich  [14]  intends  to  improve  Phoenix 
by  implementing  an  iterative  map  phase.  Similarly,  Grex  iteratively  processes 
input  data  by  applying  the  tiling  technique  [15].  There  is  also  a  MapReduce 
implementation  for  the  IBM  Cell  architecture  [16]. 


Property 

Grex 

Mars 

MapCG 

[ii] 

Mithra 

GPMR 

Single  pass  execution 

yes 

no 

yes 

yes 

yes 

yes 

Locks  or  atomics 

no 

no 

yes 

yes 

no 

no 

Parallel  split 

yes 

no 

no 

no 

no 

no 

Load  balancing 

yes 

no 

no 

no 

no 

yes 

Lazy  emit 

yes 

no 

no 

no 

no 

no 

GPU  caching 

yes 

no 

no 

yes 

no 

no 

GPU  cluster 

no 

no 

no 

no 

yes 

yes 

Table  1:  Comparisons  of  the  GPU-based  MapReduce  frameworks 


Several  GPU-based  MapReduce  frameworks  have  been  developed  by  differ¬ 
ent  researchers.  A  GPU-based  MapReduce  framework  developed  by  Catanzaro 
et  al.  [17]  supports  a  constrained  MapReduce  model  that  assumes  that  the  keys 
are  known  in  advance  and  a  map  function  produces  only  one  output.  Due  to 
these  assumptions,  the  applicability  of  [17]  is  limited.  Especially,  it  is  not  di¬ 
rectly  applicable  to  process  raw  (i.e.,  not  tokenized)  variable  size  data.  Table  1 
summarizes  the  key  features  provided  by  several  GPU-based  MapReduce  frame¬ 
works  that  support  full  MapReduce  functionalities.  Mars  [7]  is  a  GPU-based 
MapReduce  framework.  It  shows  up  to  lOx  speedup  over  Phoenix  [9];  however, 
it  is  not  optimized  for  the  GPU  architecture.  To  overcome  the  lack  of  dynamic 
memory  management  in  many  GPUs,  Mars  computes  the  precise  amount  of 
GPU  memory  required  for  processing  a  MapReduce  job  in  the  first  pass.  It 
does  actual  data  modification  and  generation  in  the  second  pass.  Since  the  first 
pass  performs  every  MapReduce  task  except  for  producing  actual  output,  Mars 
essentially  executes  a  MapReduce  job  twice.  Neither  does  it  utilize  the  GPU 
memory  hierarchy. 

MapCG  [8]  uses  atomic  operations  to  support  dynamic  memory  management 
in  global  memory.  In  this  way,  MapCG  avoids  two  pass  execution  of  Mars. 
However,  atomic  operations  require  sequential  processing  of  dynamic  memory 
allocation  or  deallocation  requests,  incurring  performance  penalties.  Ji  and  Ma 
[11]  enhance  the  performance  of  MapReduce  applications  using  shared  memory 
in  a  GPU.  However,  in  their  approach,  half  the  threads  are  used  to  support 
synchronization  based  on  atomic  operations,  incurring  potentially  non-trivial 
overheads.  Different  from  these  GPU-based  MapReduce  frameworks  [17,  7,  8, 
11],  Grex  1)  processes  a  MapReduce  job  in  a  single  pass,  2)  supports  parallel 
split,  3)  facilitates  load-balanced  data  partitioning  among  the  map  and  reduce 
tasks  of  a  job,  4)  supports  lazy  emit,  and  5)  efficiently  utilizes  the  GPU  memory 
hierarchy  for  performance  enhancement.  It  takes  advantage  of  not  only  shared 
memory  but  also  constant  memory  and  texture  cache,  while  requiring  no  atomic 
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operations  or  locks. 

Mithra  [18]  uses  Hadoop  as  a  backbone  to  cluster  GPU  nodes  for  MapRe¬ 
duce  applications.  By  building  a  GPU  cluster  on  top  of  Hadoop,  Mithra  takes 
advantage  of  powerful  features  of  Hadoop  such  as  fault  tolerance,  while  leverag¬ 
ing  a  large  number  of  GPU  threads.  As  a  result,  it  significantly  outperforms  a 
CPU-only  Hadoop  cluster.  However,  it  does  not  support  the  key  optimizations 
of  Grex  described  before.  GPMR  [12]  is  an  open  source  MapReduce  frame¬ 
work  designed  for  a  GPU  cluster.  It  relies  on  simple  clustering  of  GPU  nodes 
without  supporting  a  distributed  file  system  or  fault  tolerance.  Instead,  it  fo¬ 
cuses  on  providing  novel  features  for  partial  data  aggregation  to  decrease  the 
I/O  between  the  CPU  and  GPU  in  a  node  as  well  as  network  I/O  between  the 
map  and  reduce  tasks.  However,  GPMR  does  not  consider  to  support  the  main 
features  of  Grex.  Also,  the  current  GPMR  implementation  processes  already 
tokenized  data  only.  Thus,  variable  size  data  have  to  be  tokenized  using  the 
sequential  input  split  phase  in  Hadoop  or  by  any  other  means  before  being  pro¬ 
cessed  by  GPMR.  Mithra  [18]  and  GPMR  [12]  are  complementary  to  Grex  in 
the  sense  that  the  current  version  of  Grex  uses  a  single  GPU,  similar  to  Mars 
[7],  MapCG  [8],  and  [11].  Grex  could  be  combined  with  Hadoop  or  GPMR  to 
support  MapReduce  applications  in  a  GPU  cluster.  A  thorough  investigation  is 
reserved  for  future  work. 

Kwon  et  al.  [6]  investigate  several  causes  of  skews  in  Hadoop  applications. 
Grex  evenly  distributes  data  to  the  map  and  reduce  tasks.  This  approach  will 
alleviate  partitioning  skews;  however,  we  do  not  claim  Grex  completely  solves 
skew  problems.  For  example,  a  certain  <key,  value>  pair  can  be  computa¬ 
tionally  more  expensive  to  process  than  the  others  [6].  Efficiently  dealing  with 
potential  skews  in  a  GPU-based  MapReduce  framework  is  an  important  yet 
rarely  explored  research  problem.  Other  GPU-based  MapReduce  frameworks 
including  [7,  8,  11,  12,  18]  do  not  consider  this  problem. 

There  have  been  many  recent  applications  of  GPU  computing  to  differ¬ 
ent  problem  domains  such  as  Fast  Fourier  Transform  [19],  matrix  operations 
[20],  encryption  and  decryption  [21],  genetics  [22],  intrusion  detection  [23],  and 
databases  [24].  Using  GPUs,  these  research  efforts  have  achieved  significant 
performance  improvement  [25,  26].  These  results  call  for  a  general  GPU-based 
parallel  programming  framework  to  make  the  architectural  advantages  of  GPUs 
available  with  reduced  complexity  compared  to  manual  architectural  optimiza¬ 
tions. 


3.  Preliminaries 

In  this  section,  backgrounds  of  MapReduce  and  GPUs  needed  for  the  rest 
of  the  paper  are  given.  Through  the  paper,  word  count  is  used  as  a  running 
example  to  explain  key  concepts  of  MapReduce  and  Grex  especially  in  terms 
of  dealing  with  variable  size  data.  Although  word  count  is  conceptually  simple, 
it  requires  all  the  MapReduce  phases  and  deals  with  variable-size  textual  data 
challenging  to  handle  in  a  GPU. 
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3.1.  MapReduce 

A  MapReduce  framework  executes  an  application  as  a  set  of  sequential 
phases  such  as  split,  map,  group,  and  reduce.  Each  phase  is  executed  by  a 
set  of  tasks  that  run  in  parallel.  A  task  is  sequentially  processed  by  a  single 
processing  element  called  a  worker.  In  the  split  phase,  the  input  is  split  into 
smaller  subproblems  called  records  and  distributed  among  the  workers.  For  ex¬ 
ample,  a  text  file  such  as  a  web  document  can  be  split  at  new  lines  in  the  word 
count  application.  In  the  map  phase,  the  input  data  is  mapped  to  intermediate 
<key,  value>  pairs  by  a  user  defined  map()  function.  In  word  count,  a  map 
task  can  be  designed  to  process  a  user-specified  number  of  text  lines  to  emit 
<word,  1>  pairs  for  each  word  in  the  lines.  In  the  group  phase,  intermediate 
<key,  value>  pairs  are  sorted  and  grouped  based  on  their  keys.  Finally,  the 
intermediate  pairs  are  reduced  to  the  final  set  of  <key,  value>  pairs  by  a  user 
defined  reduceQ  function.  As  an  example,  for  10  <  ’the’,  1>  intermediate  <key, 
value>  pairs  produced  by  the  map  tasks,  one  new  pair,  <  ’the’,  10>  is  emitted 
as  the  final  output.  (Some  applications,  e.g.,  string  match,  do  not  need  grouping 
or  reduction.) 

Generally,  a  user  is  required  to  implement  the  nrapQ  and  reduceQ  func¬ 
tions  and  the  grouping  logic,  which  are  often  specific  to  applications.  Since  a 
nrap/reduce  task  or  grouping  is  executed  sequentially  in  MapReduce,  a  user 
is  only  required  to  specify  sequential  logic.  The  rest  of  the  problem,  such  as 
data  distribution  to  tasks,  collection  of  results,  and  error  recovery  for  parallel 
processing,  is  handled  by  the  MapReduce  runtime  system.  Thus,  developing  a 
parallel  application  via  MapReduce  does  not  impose  undue  complexity  of  par¬ 
allel  programming  on  users. 

3.2.  General  Purpose  Graphics  Processing  Units 

Graphics  processors  have  architectural  advantages  that  can  be  harnessed 
to  solve  computational  problems  in  domains  beyond  graphics  [26].  Further, 
the  introduction  of  general  purpose  programming  languages  such  as  CUDA  [1] , 
Stream  [27],  and  OpenCL  [2]  has  promoted  their  use  as  general  purpose  proces¬ 
sors  [26]. 

GPU  code  has  an  entry  function  called  a  kernel.  A  kernel  can  invoke  other 
(non-kernel)  device  functions.  Once  a  kernel  is  submitted  for  execution,  the 
host  has  to  wait  for  its  completion.  No  interaction  is  possible  with  a  running 
kernel.  A  GPU  has  a  number  of  SIMT  (Single  Instruction  Multiple  Threads) 
multiprocessors  as  shown  in  Figure  1.  Threads  are  temporarily  bound  to  these 
SIMT  elements.  Thus,  a  GPU  can  accommodate  thousands  or  even  more  live 
threads.  The  number  of  processing  elements  and  maximum  number  of  active 
threads  vary  among  GPUs  [26,  28].  Threads  are  grouped  into  blocks.  A  (thread) 
block  is  scheduled  dynamically  to  run  on  one  of  the  available  multiprocessors. 
All  the  threads  of  one  block  execute  the  same  code  because  the  underlying 
architecture  is  SIMT.  When  the  threads  in  a  block  branch  to  different  code 
segments,  branches  are  executed  sequentially.  This  phenomenon,  called  control 
divergence,  can  result  in  dramatic  performance  degradation.  Thus,  the  threads 
in  Grex  are  designed  to  execute  the  same  logic  at  a  time  to  avoid  divergence. 
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To  support  efficient  memory  access,  GPUs  employ  a  sophisticated  memory 
hierarchy  [29].  Specifically,  there  are  five  types  of  memory  as  shown  in  Figure  1: 
1)  a  set  of  general  propose  registers  per  thread;  2)  shared  memory  that  is  shared 
by  all  the  threads  in  a  block.  It  can  be  used  as  a  software  managed  cache  to  cache 
data  in  the  off-chip  global  memory;  3)  a  read-only  constant  memory  that  can 
be  used  as  software  managed  cache  for  the  global  memory  or  to  store  frequently 
accessed  constant  data  that  is  small  in  size;  4)  read-only  hardware  managed 
texture  cache  that  caches  global  memory;  and  5)  global  memory.  The  size  and 
access  delay  generally  increases  from  1)  to  5).  General  purpose  registers  are 
several  orders  of  magnitude  faster  than  global  memory  whose  access  latency  is 
hundreds  of  cycles.  Global  memory  is  the  largest  memory  in  the  GPU.  It  can  be 
accessed  by  any  thread  and  even  from  the  host  CPU.  However,  it  is  the  slowest 
memory. 

Shared  memory  is  almost  as  fast  as  the  registers.  Every  thread  of  one  thread 
block  shares  the  shared  memory;  therefore,  the  threads  in  one  thread  block  can 
communicate  easily  with  each  other.  However,  a  thread  in  one  thread  block 
cannot  access  the  shared  memory  of  a  different  thread  block.  Constant  and 
texture  memory  have  short  access  latency  upon  a  cache  hit.  Upon  a  miss,  they 
have  the  same  latency  as  global  memory.  Constant  and  texture  memory  can 
be  used  to  store  read-only  data.  Any  thread  can  access  any  data  stored  in 
constant  or  texture  memory.  Since  constant  memory  allocation  is  static,  the 
amount  of  constant  memory  to  be  used  must  be  known  at  compile  time.  As 
a  result,  constant  data  have  to  be  copied  to  the  fixed  address  before  it  can  be 
accessed.  Hence,  in  Grex,  constant  memory  is  used  to  keep  fixed-size  read-only 
data,  such  as  a  translation  table  used  to  convert  upper  case  characters  to  lower 
case  ones  in  word  count.  If  the  size  of  a  user  specified  data  structure  exceeds 
the  amount  of  constant  memory,  a  compile  time  error  is  issued  by  Grex.  Unlike 
shared  memory  and  constant  memory,  texture  cache  is  managed  by  hardware. 
It  does  not  require  static  allocation  and  provides  good  performance  for  random 
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Figure  1:  General  Purpose  GPU  Architecture.  (SP:  Streaming  Multi-Processor) 


Algorithm  1  Control  flow  of  Grex  for  processing  variable  size  data 

1.  Read  data  blocks  from  global  memory  and  cache  them  evenly  in  shared 
memory  across  the  thread  blocks.  Execute  the  parallel  split  phase.  (Skip 
this  step  if  the  data  size  is  fixed.) 

2.  Execute  the  map  phase.  Compute  a  unique  address  for  each  intermediate 
key  found  after  finishing  the  map  phase.  Using  the  unique  addresses,  write 
the  keys  to  global  memory  in  parallel  without  relying  on  locks  or  atomics. 

3.  Perform  boundary  merge  in  global  memory  to  combine  a  key  divided  be¬ 
tween  two  adjacent  thread  blocks,  if  any.  (Skip  this  step  if  the  data  size 
is  fixed.) 

4.  Group  the  intermediate  keys  via  parallel  sorting. 

5.  Read  the  intermediate  keys  from  global  memory  and  cache  them  evenly 
in  shared  memory  across  the  thread  blocks.  Execute  the  reduce  phase. 
Merge  the  results  of  the  reduce  phase  performed  by  two  adjacent  thread 
blocks  if  necessary.  (Skip  this  step  if  the  data  size  is  fixed.) 

6.  Perform  lazy  emit.  Write  the  results  to  global  memory.  (Lazy  emit  is 
optional.  If  lazy  emit  is  not  used,  the  keys  in  the  previous  steps  in  this 
pseudo  code  should  be  replaced  with  <key,  value>  pairs.) 


access  of  read-only  data. 

4.  System  Design 

This  section  describes  the  MapReduce  phases  and  buffer  management  scheme 
of  Grex. 

4-1.  MapReduce  Phases  in  Grex 

Grex  has  five  stages  of  execution:  1)  parallel  split,  2)  map,  3)  boundary 
merge,  4)  group,  and  5)  reduce.  A  pseudo  code  that  summarizes  these  stages  is 
given  in  Algorithm  1.  Note  that  the  parallel  split  phase  replaces  the  sequential 
input  split  phase  of  MapReduce  [3,  4],  Also,  lazy  emit  is  not  an  additional 
phase  but  it  postpones  emitting  values  until  the  end  of  the  entire  MapReduce 
computation.  Hence,  only  the  boundary  merge  is  added  to  the  MapReduce 
control  flow.  Moreover,  the  parallel  split  and  boundary  merge  phases  are  only 
needed  to  handle  variable  size  data.  If  the  size  of  data  is  fixed,  Grex  distributes 
the  same  number  of  data  items,  e.g.,  integers  or  floats,  to  each  task  for  load 
balancing.  Therefore,  no  parallel  split  or  boundary  merge  is  needed  for  fixed 
size  data.  A  detailed  description  of  the  MapReduce  phases  of  Grex  follows. 

1.  Parallel  split  phase:  In  the  parallel-split  step,  Grex  initially  assigns  one 
unit  of  data  to  each  thread.  For  example,  it  can  assign  a  single  character  in 
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the  input  documents  to  a  thread  to  compute  the  starting  position  and  length  of 
each  string  in  the  input  as  shown  in  Figure  2.  For  variable  size  data,  a  token  is 
a  key.  For  example,  a  word  or  URL  in  word  counting  or  web  document  analysis 
is  a  key.  Thus,  in  the  rest  of  this  paper,  we  use  keys  and  tokens  interchangeably. 

The  jth  thread  in  the  ith  thread  block,  T,;  j,  is  associated  with  a  unique  thread 
identifier:  TID(z,  j)  =  i  x  BS  +  j  where  BS  is  the  size  of  the  data  unit  handled 
by  a  thread  block,  which  is  equal  to  the  number  of  the  threads  in  a  block  in  the 
word  count  example  given  in  this  paper.  Ti  :j  writes  1  (or  0)  to  vec  [TID(z,  j)] 
in  Step  1  of  Figure  2,  if  the  assigned  byte,  i.e.,  input [TID(i,  j)],  is  a  token  (or 
non-token)  byte  using  the  user-specified  istokenO  function  such  as  ischarO 
in  word  count.  (Bit  sting  vec  is  a  temporary  data  structure  allocated  in  shared 
memory.) 
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3.  find  the  beginning  and  length  of  each  key 

km[0]  =  {0,  3,  T}  km[0]  =  {9+0,  5,  T} 

km [4]  =  {4,  3,  T}  km[6]  =  {9+6,  3,  T} 

km[8]  =  {8,  1,  T} 


Figure  2:  Kernel  1  -  Parallel  Split  (Steps  1  and  2)  and  Map  Phase  (Step  3) 
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Figure  3:  Key  record  data  structure 


Grex  applies  the  parallel  prefix  sum  algorithm  [30]  to  the  bit  string  to  com¬ 
pute  the  starting  position  and  length  of  each  key  in  parallel  as  shown  in  Step 
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2  of  Figure  2.  To  give  an  illustrative  example  of  parallel  split,  we  introduce 
a  data  structure  in  Figure  3,  which  shows  a  temporary  data  structure,  called 
keyRecord,  used  for  the  parallel  split  and  map  phases.  The  pointer  field  points 
to  the  beginning  of  a  variable  size  key,  e.g.,  a  word  or  URL.  The  length  field  is 
used  to  store  the  length  of  a  key.  Also,  the  valid  attribute  is  used  to  indicate  the 
beginning  of  a  key.  (This  data  structure  is  not  necessary  for  fixed  size  data.) 

An  instance  of  keyRecord  is  referred  to  km  in  Figure  2  and  stored  in  a  register 
by  a  thread.  As  shown  in  Step  2  of  Figure  2,  thread  T)  j  knows  that  it  has  the 
beginning  of  a  token,  if  vec[TID(?',j  —  1)]  =  0  and  vec  [TID(f,y )]  =  1  where 
i  >  0  and  j  >  1  and  sets  km[TID(?,j)]  .valid  =  T(rue)  in  Step  3  of  Figure  2.  It 
also  sets  the  km[TID(i,y)]  .pointer  to  the  beginning  of  the  key,  while  storing 
the  length  of  the  key  to  km[TID(i,y)]  .length. 

In  Grex,  thread  is  assumed  to  have  the  beginning  of  a  string,  if  j  =  0 
and  vec  [TID(f,  j)]  =  1.  In  fact,  this  is  not  always  true,  because  a  string  can 
be  divided  between  two  adjacent  blocks  as  shown  in  the  figure.  Such  a  case, 
if  any,  is  handled  by  the  boundary  merge  phase  that  will  be  discussed  shortly. 
The  boundary  merge  phase  is  needed,  since  two  different  GPU  thread  blocks 
cannot  directly  communicate  with  each  other  due  to  the  current  GPU  hardware 
design. 

The  time  complexity  of  the  parallel  prefix  sum,  which  is  the  most  expensive 
in  the  parallel  split  phase,  is  Q({jf)lgn)  where  n  is  the  length  of  the  bit  string 
indicating  the  input  size  in  terms  of  the  number  of  bytes  and  N  is  the  total  num¬ 
ber  of  the  GPU  threads  that  can  run  at  once.  Thus,  for  large  N,  parallel  split 
is  considerably  faster  than  a  sequential  algorithm  used  in  MapReduce  for  tok- 
enization,  which  is  0(n).  However,  existing  MapReduce  frameworks  including 
[3,  4,  7,  8,  9,  10,  11,  12]  do  not  support  parallel  split. 

Notably,  a  user  only  has  to  specify  an  istokenO  function,  such  as  the 
ischarO  function  in  Figure  2.  Parallel  split  is  supported  by  the  underlying 
Grex  runtime  system  and  hidden  from  the  user.  Since  a  user  has  to  write 
a  function  similar  to  istokenO  in  any  existing  MapReduce  framework  for  se¬ 
quential  input  split,  parallel  split  does  not  increase  the  complexity  of  developing 
a  MapReduce  application.  Notably,  all  threads  finish  the  parallel  split  phase 
simultaneously,  because  they  process  the  same  number  of  bytes  using  the  same 
algorithm.  Thus,  no  synchronization  between  them  is  required. 

2.  Map  phase:  A  user  specifies  a  User:  :map()  function,  similar  to  MapRe¬ 
duce  [3]  and  Hadoop  [4] .  Grex  further  processes  the  keys  in  shared  memory  tok- 
enized  during  the  parallel  split  step  via  the  map  tasks,  which  execute  the  map() 
function  in  parallel.  In  word  count,  the  parallel  split  and  map  phases  are  imple¬ 
mented  as  a  single  kernel,  since  the  map  function  is  only  required  to  write  the 
keys  found  in  the  parallel  split  phase  to  global  memory.  If  km  [TID  (i ,  j )  ]  .  valid 
==  T  in  Step  3  of  Figure  2,  thread  X^-  writes  km[TID(i,j)]  .pointer  into 
keys  [[TID(i, j)/2]]  in  global  memory.  Note  that  we  divide  the  thread  ID 
by  2,  since  a  block  cannot  have  more  keys  than  half  the  bytes  assigned  to  the 
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Figure  4:  Kernel  2  -  Boundary  merge  phase 


block.  A  string,  e.g.,  a  word  or  URL,  has  at  least  one  non-null  character  fol¬ 
lowed  by  a  null  character  that  indicates  the  end  of  the  string.  As  the  TID  of  a 
thread  is  unique,  the  threads  that  have  valid  pointers  to  the  keys  can  write  their 
pointers  to  the  global  memory  location  independently  from  each  other.  Thus, 
all  the  pointers  to  the  beginnings  of  the  keys  are  written  to  the  global  memory 
at  the  same  time  with  no  need  to  synchronize  the  threads. 

The  input  data  structure  in  Figure  2  is  part  of  a  Grex  buffer,  which  keeps  the 
input  data  and  output  of  the  map  phase  in  global  memory,  respectively.  Input  is 
evenly  divided  among  the  thread  blocks  and  cached  in  the  shared  memory  of  each 
block  by  Grex.  Also,  the  keys  data  structure  is  the  output  buffer  used  to  write 
the  output  of  the  map  phase.  A  user  can  specify  buffers  needed  to  store  input, 
intermediate,  or  output  data  in  global  memory.  Grex  caches  them  in  shared 
memory  or  texture  cache  in  a  transparent  manner.  In  addition,  temporary  data 
structures,  such  as  vec  and  km  in  Figure  2,  are  created  and  managed  by  Grex. 
In  the  current  version  of  Grex,  vec  and  km  are  stored  in  shared  memory  and 
registers,  respectively.  After  completing  the  execution  of  a  kernel  such  as  the 
one  for  parallel  split  and  map,  Grex  deallocates  the  temporary  data  structures 
in  shared  memory. (Buffer  management  and  caching  is  discussed  in  Section  4.2.) 

In  general,  if  the  size  of  data  is  bigger  than  the  shared  memory  size,  the  Grex 
steps  are  executed  iteratively.  Hence,  the  job  size  is  not  limited  by  the  shared 
memory  size.  In  addition,  data  upload  to  and  download  from  the  GPU  in  Grex 
is  overlapped  with  the  computation  in  the  GPU  to  reduce  the  I/O  overhead. 

Grex  supports  an  optional  feature  called  lazy-emit,  in  which  the  value  is  nei¬ 
ther  created  nor  held  in  memory  before  the  user-specified  phase.  For  lazy-emit, 
a  user  only  has  to  specify  in  which  phase  the  value  should  be  created  and  emitted 
to  memory  and  what  value  should  be  emitted  by  defining  a  User :  :  emit  (key) 
function.  For  example,  the  emit()  function  for  word  count  returns  1  for  each 
word.  Also,  a  user  can  optionally  implement  a  User:  :  combine  ()  function  to  do 
partial  reduction  of  intermediate  <key,  value>  pairs  produced  in  the  map  phase 
for  an  associative  and  commutative  reduceQ  function,  similar  to  MapReduce. 

3.  Boundary  merge  phase:  We  have  designed  a  simple  yet  effective  parallel 
method,  called  boundary  merge,  to  check  whether  any  key  is  split  between 
neighboring  thread  blocks  and  merge  it  to  a  single  key.  If  there  are  B  >  1 
blocks,  Grex  executes  a  kernel  using  B  —  1  threads  to  do  boundary  merge  in 
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1.  Cache  input  buffer  in  texture  cache 
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2.  Move  pointers  in  global  memory  for  sorting 
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Figure  5:  Kernel  3  -  Sorting  intermediate  keys 


global  memory,  since  there  are  B  —  1  block  boundaries.  When  the  map  phase 
is  completed,  the  result  is  written  to  global  memory  as  discussed  before.  The 
B—l  threads  for  boundary  merge  can  start  immediately  when  the  kernel  for  the 
parallel  split  and  map  phases  completes.  Also,  they  do  not  have  to  coordinate 
or  communicate  with  each  other,  since  they  do  not  share  any  data  or  other 
resources  with  each  other.  Thus,  they  work  independently  in  parallel. 

To  do  boundary  merge,  thread  i  (0  <  i  <  B  —  1)  observes  the  end  of  the  ith 
block  and  the  beginning  of  the  (i  +  l)th  block  to  check  whether  or  not  both  of 
them  are  valid  (i.e. ,  istokenQ  =  true)  as  shown  in  Figure  4.  If  this  condition 
is  satisfied,  thread  i  merges  them  to  a  single  key  and  eliminates  the  partial  key 
in  the  (*  +  l)t/l  block.  Also,  the  thread  decrements  the  number  of  the  keys  in 
the  (i  +  l)th  block.  At  the  end  of  the  boundary  merge  step,  the  pointers  to  the 
variable  size  tokens  are  stored  in  keys  buffer  in  global  memory  as  shown  at  the 
bottom  left  in  Figure  5. 

4.  Grouping  phase:  In  this  step,  the  intermediate  <key,  value>  pairs  are 
sorted,  similar  to  MapReduce  and  Hadoop.  In  this  paper,  parallel  bitonic  sorting 
is  used  for  grouping.  For  sorting,  a  user  only  has  to  specify  how  to  compare  keys 
through  the  User:  :cmp()  function.  For  example,  a  user  can  use  the  strcmpO 
function  in  word  count. 

To  decrease  the  memory  access  delay  for  parallel  sorting,  we  bind  the  input 
buffer  in  global  memory  to  the  on-chip  texture  cache  as  shown  in  Step  1  of 
Figure  5.  We  use  the  texture  cache  rather  than  shared  memory,  since  sorting 
generally  involves  data  accesses  across  thread  blocks.  For  grouping  in  Grex,  data 
comparisons  are  done  in  the  texture  cache,  which  is  read-only  and  managed 
by  the  hardware.  Actual  movements  of  the  pointers  to  variable  size  data,  if 
necessary  for  sorting,  are  performed  by  Grex  in  global  memory  as  shown  in 
Step  2  of  Figure  5.  As  a  result,  the  pointers  in  the  figure  are  sorted  in  non¬ 
ascending  lexicographic  order  of  the  keys,  i.e.,  words  in  word  count. 

For  grouping,  Mars  [7]  and  MapCG  [8]  create  an  index  space  to  keep  the 
<key,  value>  pairs  and  apply  hashing  to  find  <key,  value>  pairs  in  global 
memory.  Our  approach  is  more  efficient  than  these  methods,  since  it  reduces 
the  memory  consumption  as  well  as  the  delay  and  contention  for  global  memory 
access  by  leveraging  texture  cache  and  lazy  emit,  while  requiring  no  index  space. 
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Figure  6:  Kernel  4  -  Reduction 


5.  Reduce  phase:  Grex  provides  a  User:  : reduce ()  API  to  let  a  user  write  a 
reduce  function  specific  to  the  application  of  interest.  To  significantly  enhance 
performance,  Grex  uses  a  large  number  of  thread  blocks  to  reduce  <key,  value> 
pairs  in  a  parallel,  load  balanced  manner.  Grex  evenly  distributes  intermediate 
<key,  value>  pairs  to  each  reduce  task  to  avoid  load  imbalance  due  to  data 
skews  as  discussed  before.  (Only  one  thread  block  is  used  in  Figure  6  due  to 
the  small  data  size.) 


KcyRecord 
— •  pointer 

— •  count 

first  =  !cmp(prev,  this) 


Figure  7:  Key  record  data  structure  for  reduction 

In  the  reduce  phase,  Grex  uses  a  temporary  data  structure  in  Figure  7.  The 
pointer  field  points  to  the  beginning  of  a  variable  size  key.  The  count  field  is 
used  to  compute  the  frequency  of  a  key’s  occurrences  in  natural  language  or 
web  documents.  Also,  the  first  boolean  attribute  is  used  to  indicate  the  first 
occurrence  of  a  key.  An  instance  of  this  data  structure  is  called  kr.  It  is  stored 
in  a  register  by  a  thread.  (This  data  structure  is  unnecessary  for  fixed  size  data, 
similar  to  the  temporary  data  structures  used  for  the  parallel  split  and  map 
phases.) 

In  Step  1  of  Figure  6,  each  thread  is  assigned  a  word  and  it  compares  its 
word  to  the  word  of  the  next  thread  (if  any).  A  thread  sets  its  bit  in  the 
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temporary  bit  string,  vec,  newly  allocated  in  shared  memory,  if  the  two  strings 
are  different.  Using  the  bit  string,  the  threads  compute  the  exclusive  parallel 
prefix  sum  to  calculate  the  total  number  of  unique  keys.  In  Step  2,  thread  re¬ 
sets  kr[TID(i,  j)]  .valid  =  T(rue),  if  it  has  a  pointer  to  the  first  appearance 
of  a  key  or  vice  versa.  Also,  kr  [TID(i ,  j )]  .pointer  and  kr  [TID(i ,  j )]  .  length 
are  set  to  the  beginning  of  a  unique  key  and  the  number  of  the  occurrences,  i.e. , 
word  count,  of  the  key,  respectively.5  Finally,  each  thread  writes  a  pointer  to  a 
variable  size  key,  e.g.,  a  word,  to  the  keys  buffer  and  a  value,  e.g.,  word  count, 
to  the  values  buffer  in  global  memory  using  the  information  in  the  temporary 
kr  data  structure  in  Figure  6.  Thus,  the  final  output  is  <dog,  1>,  <inside,  1>, 
and  <the,  2>  in  this  example. 

Grex  also  applies  a  variant  of  the  boundary  merge  technique  in  the  reduce 
phase  to  produce  correct  final  <key,  value>  pairs  in  global  memory,  if  one  group 
of  <key,  value>  pairs  with  the  same  key  is  divided  over  more  than  one  thread 
blocks  due  to  load  balancing  among  the  reducers.  An  example  code  snippet  of 
boundary  merge  is  given  in  Section  5. 

4-2.  Buffer  Management 

In  Grex,  a  user  can  use  buffers  managed  by  the  Grex  runtime  system  for 
efficient  memory  access  in  the  GPU.  A  buffer  in  Grex  is  a  data  container  that  can 
partially  or  entirely  reside  in  global  memory,  texture  cache,  or  shared  memory 
depending  on  the  size  and  type  of  a  buffer  and  the  availability  of  shared  memory 
or  texture  cache  space.  Grex  provides  a  number  of  buffering  methods  and  user 
APIs  to  enhance  the  performance  by  leveraging  the  GPU  memory  hierarchy. 

To  use  a  buffer,  a  user  needs  to  specify  the  access  mode  and  size  of  a  buffer. 
Grex  provides  two  buffer  types:  1)  a  constant  buffer  and  2)  a  buffer.  A  constant 
buffer  is  stored  in  the  constant  memory.  Its  size  should  be  fixed  and  known 
at  compile  time  as  discussed  before.  We  describe  non-constant  buffers  in  the 
remainder  of  this  subsection.  A  (non-constant)  buffer  is  either  read-only  or  read- 
write.  If  the  user  specifies  a  buffer  as  read-only,  Grex  uses  the  texture  cache  to 
cache  the  buffer.  As  the  texture  cache  is  managed  by  the  hardware,  a  user  does 
not  need  to  specify  the  size.  Grex  simply  binds  a  read-only  buffer  to  the  texture 
cache.  Thus,  data  are  loaded  from  the  texture  cache  to  shared  memory  by  the 
hardware,  if  the  data  are  available  in  the  texture  cache.  On  the  other  hand,  if 
a  buffer  is  read-write,  Grex  directly  loads  data  from  global  memory  to  shared 
memory  instead  of  checking  the  texture  cache  first.  In  Grex,  a  user  can  also 
specify  a  specific  buffer  to  become  active  during  a  specific  MapReduce  phase 
via  the  User:  :init()  callback  function  invoked  and  executed  by  Grex  before 
processing  each  phase. 

To  process  /  bytes  of  input  data,  a  user  needs  to  specify  the  size  of  the  buffer- 
/block,  P,  and  the  number  of  bytes  processed  by  each  GPU  thread,  D(>  1). 


r>In  the  reduce  phase,  the  length  field  is  overloaded  to  indicate  the  word  count.  How- 
ever,  Grex  is  not  tied  to  this  overloading.  Depending  on  a  specific  application  of  interest,  a 
temporary  data  structure  different  from  keyRecord  can  be  used. 
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Algorithm  2  Grex  buffer  management 
The  following  steps  are  executed  in  the  CPU: 

1.  Invoke  User:  :init()  function.  In  this  function  the  user  enables/disables 
buffers  for  the  current  phase  and  modifies  their  properties  as  needed. 

2.  Before  starting  to  execute  the  first  phase  of  a  job,  upload  input  data  to 
global  memory. 

3.  Traverse  the  list  of  buffers  to  compute  the  total  size  of  the  shared  memory 
cache  needed  by  the  enabled  buffers.  Also,  bind  buffers  marked  as  read¬ 
only  to  the  texture  cache. 

4.  Launch  the  next  MapReduce  kernel  with  the  calculated  total  cache  size. 
The  following  steps  are  executed  in  the  GPU: 

5.  Initialize  the  shared  memory  cache  assigned  to  each  buffer  in  parallel. 

6.  Run  the  user-specified  phase  kernel. 

7.  Write  the  result  of  the  kernel  execution  temporarily  stored  in  in  shared 
memory,  if  any,  to  the  corresponding  read-write  buffer  in  global  mem¬ 
ory.  (To  do  this,  each  thread  that  has  data  to  write  to  global  memory 
independently  computes  a  unique  address  as  discussed  in  Section  4.1.) 


Thus,  to  process  I  bytes,  Grex  uses  B  =  blocks  and  T  =  threads /block. 
By  default,  the  size  of  the  shared  memory  used  to  cache  a  fraction  of  a  buffer  in 
each  block  is  equal  to  the  size  of  the  per-block  buffer  in  global  memory.  For  ex¬ 
ample,  if  the  size  of  a  per-block  buffer  is  256  bytes,  256  bytes  of  shared  memory 
in  each  thread  block  (if  available)  is  used  to  cache  the  buffer  by  default.  Grex 
also  provides  an  option  by  which  a  user  can  specify  a  different  amount  of  shared 
memory  to  be  used  to  cache  a  buffer  in  each  thread  block. 

In  Grex,  data  are  fetched  to  shared  memory  in  sequence  and  the  oldest  data  is 
replaced,  if  necessary,  to  accommodate  new  data,  since  a  considerable  fraction 
of  MapReduce  applications  linearly  scan  input  data  to  group  and  aggregate 
them  in  parallel.  The  overall  workflow  of  the  buffer  manager  in  Grex  is  given 
in  Algorithm  2.  Grex  buffer  manager  first  calculates  the  total  cache  size  needed 
by  each  thread  block  by  traversing  the  user  defined  buffers.  After  computing 
the  total  size  of  the  (shared  memory)  caches  needed  per  thread  block,  Grex 
launches  the  kernel  to  execute  a  MapReduce  phase  in  the  GPU. 

At  the  beginning  of  a  phase,  contiguous  regions  of  shared  memory  in  a  thread 
block  is  assigned  to  the  buffers  active  in  the  phase  and  initialized  as  specified  by 
the  user.  Grex  provides  parallel  cache  initialization  policies  a  user  can  choose 
for  a  cache:  1)  The  default  policy  reads  data  from  global  memory  into  the 
cache;  2)  An  alternative  policy  initializes  every  cache  element  to  a  constant  value 
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specified  by  the  user;  3)  The  no  initialization  policy  leaves  the  cache  (e.g.,  a  cache 
associated  with  an  output  buffer)  uninitialized;  and  4)  The  user  value  policy 
initializes  each  cache  element  to  the  return  value  of  user  defined  User :  :  cache  () 
function  that  can  be  used  to  convert  the  input  data.  The  selected  policy  is 
enforced  by  Step  5  of  Algorithm  2. 

After  initializing  caches,  the  kernel  for  the  corresponding  MapReduce  phase 
is  executed  in  Step  6  of  Algorithm  2.  After  finishing  the  execution  of  a  phase 
kernel,  Grex  writes  the  resulting  data  in  shared  memory  back  to  global  memory 
and  deallocates  shared  memory.  In  this  way,  we  aim  to  ensure  that  the  result  of 
executing  a  phase  is  globally  available  to  the  following  MapReduce  phases,  if  any. 
This  approach  allows  the  next  MapReduce  phase  to  fully  utilize  shared  memory 
too.  Since  most  MapReduce  applications  performs  group-by  and  aggregation 
functions  in  parallel,  the  amount  of  results  to  be  written  back  to  global  memory 
is  not  likely  to  be  bigger  than  the  original  input  data  size.  Thus,  the  overhead 
of  write-back  to  global  memory  in  Step  7  is  expected  to  be  tolerable.  Moreover, 
Grex  supports  lazy  emit  to  further  decrease  the  shared  memory  consumption 
and  the  amount  of  data  to  write  back  to  global  memory. 

In  summary,  Grex  is  designed  to  support  the  parallel  split  and  load-balanced 
map  and  reduce  phases,  while  efficiently  utilizing  shared  memory  and  texture 
cache.  In  addition,  Grex  supports  lazy  emit  and  it  is  free  of  locks  and  atomics 
different  from  the  other  GPU-based  MapReduce  frameworks  [7,  8,  11,  12,  18]. 

In  the  future,  GPUs  may  support  a  unified  cache  model.  However,  even  in 
a  single  cache  model,  it  is  important  to  carefully  manage  the  cache.  Our  buffer 
management  scheme  may  become  simpler  and  more  efficient  in  such  an  archi¬ 
tecture,  because  it  is  unnecessary  to  use  different  caches  based  on  read/write 
or  read-only  nature  of  a  buffer.  Also,  if  future  GPUs  allow  a  thread  to  directly 
read  data  cached  by  a  different  thread  block,  boundary  merge  can  be  performed 
not  in  the  slow  GPU  DRAM  but  in  the  cache.  Thus,  it  is  possible  for  the  per¬ 
formance  of  Grex  to  further  improve  in  such  an  environment.  In  addition,  the 
algorithmic  approaches  of  Grex  such  as  parallel  split,  even  distribution  of  data 
to  the  mappers  and  reducers,  and  late  emit  will  be  still  applicable  regardless  of 
the  cache  model. 

5.  Sample  Grex  Code 

This  section  illustrate  an  example  Grex  code  for  word  count  to  provide  details 
on  handling  variable  size  data.  The  basic  structure  of  word  count  is  applicable 
to  various  text  processing  applications  that  require  word  or  token  extraction.  A 
variable  data  application  example  is  preferred  since  a  fixed  size  data  processing 
implementation  is  relatively  trivial. 

In  Code  1,  Grex  calls  the  user  specified  User:  :main()  function  to  create 
buffers  and  do  system  initialization.  In  this  example,  three  (non- const  ant) 
buffers  and  one  constant  buffer  are  created.  A  user  can  use  the  newJbuffer 
<data_type>  (per_block_size)  function  call  to  create  a  buffer  with  a  certain 
data  type  and  size  per  thread  block.  Grex  creates  a  buffer  in  global  memory 
whose  size  is  equal  to  the  product  of  the  per-block  buffer  size  and  the  number 
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of  the  thread  blocks  as  discussed  in  Section  4.  Also,  it  caches  the  buffer  in 
shared  memory  of  each  thread  block  when  the  buffer  is  accessed  at  runtime.  To 
create  a  constant  buffer,  a  user  has  to  use  the  new_const_buff  er  <data_type> 
(total_size)  API  as  shown  in  the  code.  As  a  result  of  the  Grex  system  call, 
one  system-wide  constant  buffer  of  total_size  is  created  in  constant  memory. 

1  int  Grex: : User :: main  ()  { 

2  //  Create  a  per-block  buffer  that  contains  256  bytes  of  input  data. 

3  input  =  new_buf f er<char> (256) ; 

4  //  There  are  at  most  256/2  words  from  256  bytes  of  data  due  to  the 

5  //  null  char  at  the  end  of  a  string. 

6  keys  =  new_buf  f  er<char*>  (128) ; 

7  //  Max  no.  of  values  is  equal  to  max  no.  of  words. 

8  values  =  new_buf f er<int> (128) ; 

9  //  Create  a  constant  buffer  to  hold  the  upper-case  to  lower-case 

10  //  translation  table 

n  tr_map  =  new_cbuf f er<char>  (tr_table ,  256); 

12  //  Store  boundary  key  sizes  for  merging.  There  can  be  at  most  2 

13  //  boundary  words  in  a  thread  block. 

14  keysize  =  new_buf  f  er<char>(2) ; 

15  ... 

16  > 

Code  Snippet  1:  Buffer  setup 

When  Grex  completes  executing  the  main  function,  it  reads  the  input  data 
from  disk  as  shown  in  Code  2  before  it  starts  running  a  MapReduce  job  in  the 
GPU.  Data  read  from  the  input  file  are  downloaded  to  global  memory  over  the 
PCIe  bus  that  connects  the  host  and  GPU.  The  data  is  read  and  the  job  is 
iteratively  executed  until  the  read_input()  function  returns  the  End  of  File 
(EOF).  In  this  way,  the  GPU  memory  can  be  allocated  and  the  job  can  be 
executed  in  an  incremental  manner.  Grex  also  overlaps  the  I/O  across  the  PCIe 
bus  with  the  computation  in  the  GPU,  if  any,  to  hide  the  I/O  overhead. 

1  size_t  Grex: :User : :data( 

2  Buffer  *b,  char  *d,  off_t  offset,  size_t  size){ 

3  if (buf f er->id() ==input->id() )  { 

4  input_f ile . seek(of f set) ; 

5  return  input_f ile .read (d,  size); 

6  > 

7  return  0; 

8  > 

Code  Snippet  2:  Reading  Input  Data 

Code  3  shows  an  example  user  initialization  routine  for  the  word  count  ap¬ 
plication.  Grex  invokes  the  User:  :init()  function  before  starting  to  execute 
each  phase  kernel.  In  the  function,  the  user  needs  to  declare  buffers  that  will  be 
used  in  a  specific  phase  and  set  the  type  of  each  buffer  as  either  a  read-write  or 
read-only  buffer.  A  buffer  becomes  usable  once  it  is  enabled  and  remains  valid 
until  it  is  explicitly  disabled  by  the  user. 

When  a  buffer  is  enabled,  it  is  set  as  a  read- write  buffer  by  default.  A 
user  needs  to  explicitly  set  a  buffer  as  read  only  via  the  read_only()  member 
function  as  shown  in  Code  3.  A  read-only  buffer  is  cached  in  texture  memory 
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as  discussed  before. 

Notably,  there  is  no  case  statement  for  parallel  split  or  boundary  merge  in 
Code  3,  because  the  execution  of  the  parallel-split  and  boundary  merge  phases 
plus  the  buffer  management  in  those  phases  are  handled  by  Grex.  A  user  only 
has  to  define  the  boundary  merge  function  (e.g.,  Code  5).  Also,  in  Code  3,  the 
values  buffer  declared  in  Code  1  is  enabled  in  the  reduce  phase,  because  lazy 
emit  is  applied  in  our  word  count  example. 

1  int  Grex: : User :: init(Job_t  & j ,  Task_t&  t)  { 

2  switch  (  t. phase  )  { 

3  case  MAP : 

4  //  Do  not  initialize  the  cache  for  keys 

5  //  Initialize  the  cache  for  keysize  as  zeros 

6  buf fers . enable (input ,  keys,  keysize); 

7  keys . cacheFunc (CACHE_NO_INIT) ; 

8  keysize . cacheFunc (CACHE_VALUE,  0); 

9  break ; 

10  case  GROUP : 

n  //  Specify  keys  as  the  output  buffer 

12  //  Declare  input  as  read-only  (texture  cache) 

13  set_targets(keys) ; 

14  input .  writeEnable  (false) ; 

15  break; 

16  case  REDUCE : 

17  //  Specify  keys  as  the  output  buffer 

is  //  Enable  values 

19  set_targets(keys) ; 

20  buffers . enable (values) ; 

21  break; 

22  } 

23  return  1; 

24  } 

Code  Snippet  3:  User  initialization 


1  _ device _  void  Grex :: User :: map ()  { 

2  //  Initialize  active  buffers 

3  IBUFFERS ( input ,  keys,  keysize); 

4  input [worker Id]  =  tr (input [worker Id] ) ; 

5  //  Parallel  split  of  the  input  buffer  into  words 

6  KeyRecord  kr  =  input . split (isWordChar () ) ; 

7  //  Emit  words  found  in  the  input 

8  keys . emit (kr . key ,  kr . valid) ; 

9  //  Emit  key  size  only  for  boundary  keys 

10  keysize . emit (kr . size ,  input . is_boundary (kr) ) ; 

11  } 

Code  Snippet  4:  User  map  phase  function 

In  Code  4,  the  map  phase  is  implemented  as  a  device  function  executed  in 
the  GPU.  (Code  4  —  Code  8  are  all  device  functions.)  In  Code  4,  the  tr() 
function  is  executed  by  the  threads  to  convert  the  input  data  to  lower-case  by 
referring  to  the  translation  table  stored  in  the  constant  memory,  while  removing 
non-word  characters  in  parallel.  In  Code  4,  the  parallel  split  is  invoked  via  the 
input .  split  ()  function  provided  by  Grex.  The  pointers  to  the  keys  found  as 
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a  result  of  executing  the  parallel  split  step  is  written  to  global  memory  by  the 
keys .  emit  ()  function.  In  Code  4,  only  the  keys  are  generated  due  to  lazy  emit. 

1  _ device _  void  Grex: :User : :mmerge()  { 

2  IBUFFERS (keys ,  keysize); 

3  //  Read  two  keys  on  the  boundary  of  two  adjacent  thread  blocks 

4  char  *kl  =  keys . last_of _blk (worker Id) ,  *k2  =  keys . f irst_of _blk(workerId+l) ; 

5  int  sizel  =  keysize . last_of _blk(workerId) ; 

6 

7  //  If  one  word  is  spread  across  two  adjacent  blocks,  merge  parts 

8  //  into  one  key.  This  is  the  case  if  kl  ends  where  k2  begins 

9  if  (  kl  +  sizel  ==  k2  )  //  Just  delete  the  2nd  key  to  merge 

10  keys . del_f irst_key (workerld+l) ; 

n  > 

Code  Snippet  5:  User  function  for  boundary  merge  at  the  end  of  the  map  phase 

The  boundary  merge  function  is  shown  in  Code  5.  The  function  is  executed 
in  parallel  using  B  —  1  threads  where  B  is  the  total  number  of  the  thread  blocks. 
Note  that  deleting  the  first  key  of  the  second  block  is  enough  for  a  thread  to 
merge  the  pieces  of  a  key  divided  between  the  adjacent  thread  blocks,  because 
the  last  key  in  the  first  block  already  points  to  the  beginning  of  the  merged 
word. 

1  _ device _  void  Grex: : User :: reduce (){ 

2  IBUFFERS (keys ,  values); 

3  //  Parallel  split  of  keys  into  two  groups  at  points  where  two 

4  //  consecutive  keys  are  different 

5  ReduceRecord  rr  =  keys . split (CompareKeys ()) ; 

6  keys . emit (rr . key ,  rr . valid) ; 

7  values . emit (rr . size ,  rr. valid); 

8  > 

Code  Snippet  6:  User  reduce  function 

The  reduce  function  in  Code  6  splits  the  keys  into  groups  and  combines  each 
group  with  the  same  key  into  a  single  <key,  value>  pair.  A  thread  executes 
the  user  defined  compareKeys  ()  function  to  compare  the  key  assigned  to  itself 
with  the  key  assigned  to  the  next  thread.  It  returns  0  if  the  two  keys  are  equal. 
Otherwise,  it  returns  1.  Grex  creates  a  bit  string  in  shared  memory,  in  which  1 
indicates  the  beginning  of  a  new  group  and  contiguous  0’s  following  1  indicates 
a  group,  similar  to  the  bit  string  in  Figure  6.  By  applying  the  exclusive  parallel 
prefix  sum,  Grex  computes  the  size  of  each  key  group  as  discussed  in  Section  4. 
By  doing  this,  the  reduce  function  in  Code  6  groups  the  keys  and  aggregates  each 
group  with  the  same  key  into  a  single  <key,  value>  pair,  e.g.,  <word,  count> 
in  word  count.  At  the  end  of  the  reduce  phase,  actual  values  are  generated  and 
written  to  global  memory. 

Code  7  merges  a  key  group  divided  between  two  adjacent  thread  blocks 
due  to  load  balancing.  If  there  are  B  thread  blocks,  B  —  1  threads  are  used 
for  boundary  merge  at  the  end  of  the  reduce  phase.  Using  the  user  defined 
compare  function  (e.g.,  strcmp()  in  word  count),  a  thread  compares  the  last 
key  of  a  thread  block  and  the  first  key  of  the  next  block.  If  they  are  same, 
the  frequency  of  the  word  is  changed  to  the  sum  of  the  sizes  of  the  first  and 
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second  key  groups.  This  parallel  process  is  executed  iteratively  until  no  group 
size  increases  any  further. 

1  _ device _  void  Grex: :User : :rmerge(){ 

2  IBUFFERS ( input , values ) ; 

3  char  *kl=last_of _blk (worker Id) ,  *k2=f irst_of _blk(workerId+l) ; 

4  int  *vl=plast_of _blk (worker Id) ,  *v2=pf irst_of _blk(workerId+l) ; 

5  if  (compare (kl,  k2)){ 

6  *vl+=*v2; 

7  return  1; 

8  > 

9  else  return  0; 

io  > 

Code  Snippet  7:  User  function  for  boundary  merge  after  the  reduce  phase 

1  void  Grex: : User : :end(){ 

2  compact (keys ,  values); 

3  char  *b_data  =  input  .downloadO  ; 

4  char  **b_keys  =  keys  .downloadO  ; 

5  int  *b_values  =  values . downloadO  ; 

6  display (b_keys ,  b_values) ; 

7  > 

Code  Snippet  8:  User  end  function 

Finally,  Grex  invokes  the  User:  :end()  function  when  all  phases  are  exe¬ 
cuted.  The  example  User:  :end()  function  given  in  Code  8  downloads  the  keys 
and  values  from  the  global  memory  in  the  GPU  into  the  main  memory  in  the 
host. 


6.  Performance  Evaluation 

In  this  section,  the  experimental  settings  for  performance  comparisons  among 
Mars  [7],  MapCG  [8],  and  Grex  are  described.  Also,  the  performance  evaluation 
results  are  discussed. 

We  pick  Mars  and  MapCG  for  performance  evaluation,  because  they  rep¬ 
resent  the  state-of-the-art  in  terms  of  GPU-based  MapReduce.  In  addition, 
they  are  open  source.  Ji  and  Ma  [11]  have  also  developed  a  novel  GPU-based 
MapReduce  framework.  It  uses  atomic  operations  to  avoid  two  pass  execution 
of  a  MapReduce  job,  similar  to  MapCG.  However,  [11]  is  not  publicly  available; 
therefore,  we  cannot  compare  the  performance  of  Grex  to  that  of  [11]. 

We  have  tested  the  performance  of  Grex  using  different  generations  of  GPUs 
for  Mars  and  MapCG.  The  experiments  for  performance  comparisons  between 
Grex  and  Mars  were  run  on  the  NVIDIA  GeForce  8800  GTX  GPU,  which  was 
used  for  performance  evaluation  by  the  developers  of  Mars  [7] .  The  benchmarks 
comparing  Grex  against  MapCG  were  executed  on  the  NVIDIA  GeForce  GTX 
460  GPU  that  supports  atomic  operations  needed  by  MapCG.  Using  different 
platforms  for  Mars  and  MapCG  provides  fair  comparisons,  since  Mars’  design 
was  influenced  by  relatively  restrictive  programming  support  of  older  generation 
GPUs,  in  which  atomic  operations  are  not  available  and  coalescing  memory 
accesses  has  a  significant  impact  on  the  performance.  The  host  machine  in  both 
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cases  has  the  AMD  Athlon  II  X4  630  CPU,  4GB  RAM,  and  500GB  hard  disk. 
Linux  2.6.32.21  is  installed  on  the  host. 

6.1.  Benchmarks 

In  this  section,  the  benchmarks  used  for  performance  evaluation  are  de¬ 
scribed.  Table  2  summarizes  the  characteristics  of  the  micro-benchmarks.  They 
are  popular  MapReduce  applications  and  used  for  performance  evaluation  in 
the  Mars  [7]  and  MapCG  work  [8].  By  using  the  same  benchmarks,  we  intend 
to  provide  the  fairness  of  our  performance  comparisons  with  Mars  and  MapCG. 
Similar  workloads  are  used  for  performance  evaluation  in  [11]  and  [12]  too. 

1.  Similarity  Score  (SS):  Similarity  score  is  a  matrix  application.  The  input 
is  a  list  of  document  feature  vectors  v j  that  represent  the  characteristics  of  given 
documents.  The  output  of  this  application  is  pairwise  similarity  scores  of  the 
input  documents  ,  P  G  , .  Since  the  size  of  an  input  data  is  fixed  in  SS,  no  parallel 
split  phase  is  needed.  Similarly,  the  size  of  groups  in  reduce  phase  is  known  and 
fixed.  Thus,  the  input  split  and  boundary  merge  phases  are  not  needed  as 
shown  in  Table  2.  To  compute  the  similarity  score,  we  implement  tiled  matrix 
operations,  in  which  tiles  (i.e.,  sub-matrices)  are  manipulated  independently  in 
the  map  phase  and  the  intermediate  results  are  aggregated  in  the  reduce  phase. 
The  divisors  and  the  dividends  are  calculated  in  the  map  phase.  In  the  reduce 
phase,  we  calculate  the  final  values  by  doing  the  divisions. 

2.  Matrix  Multiplication  (MM):  Since  the  input  to  MM  is  fixed  sized,  it 
does  not  require  the  input  split  and  boundary  merge  phases  as  summarized  in 
Table  2.  We  implement  tiled  matrix  multiplication  in  the  map  phase  and  add 
the  partial  sums  of  the  tiles  in  the  reduce  stage. 

3.  String  Match  (SM):  Given  an  input  string,  the  string  match  application 
finds  the  offset  of  the  first  character  for  each  occurrence  of  the  string  in  the 
input  file.  Our  implementation  in  Grex  uses  a  constant  buffer  to  store  the 
search  string.  Neither  sequential  nor  parallel  input  split  is  needed,  because  a 
match  can  be  found  without  tokenizing  the  input  data.  In  the  map  phase, 
parallel  string  matching  is  performed.  In  Grex,  boundary  merge  is  performed 
to  find  any  matching  string  divided  between  the  adjacent  thread  blocks.  This 
application  does  not  require  sorting  or  reduction,  since  the  output  buffer  already 
contains  individual  pointers  to  the  matched  strings. 

4.  Word  Count  (WC):  This  benchmark  counts  the  frequency  of  each  word’s 
occurrences  in  the  input  file.  The  final  output  is  <word,  frequency>  pairs  sorted 
in  non-ascending  order  of  frequencies.  This  benchmark  is  required  to  run  every 
phase  of  MapReduce  as  shown  in  Table  2.  Grex  does  parallel  split  and  boundary 
merge.  In  addition,  Grex  uses  the  lazy  emit  option  to  further  decrease  the 
memory  consumption  and  access  delay  by  storing  only  the  keys  until  the  end  of 
the  reduce  phase. 
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Name 

Input 

Split 

Map 

B.  Merge 

Reduce 

L.  Emit 

Data  Size 

SS 

fixed 

no 

yes 

no 

yes 

no 

512,  1024,  2048  documents 

MM 

fixed 

no 

yes 

no 

yes 

no 

512,  1024,  2048  elements 

SM 

variable 

no 

yes 

yes 

no 

no 

32,  64,  128  MB 

WC 

variable 

yes 

yes 

yes 

yes 

yes 

32,  64,  128  MB 

II 

variable 

yes 

yes 

yes 

no 

no 

32,  64,  128  MB 

PC 

variable 

yes 

yes 

yes 

yes 

no 

32,  64,  96  MB 

Table  2:  Properties  of  the  benchmarks 


5.  Inverted  Index  (II):  The  inverted  index  application  reads  a  group  of  web 
documents  and  outputs  every  encountered  link  and  the  names  of  the  web  doc¬ 
uments  that  include  the  link.  Since  the  URL  size  is  variable,  this  application 
requires  the  parallel  split,  map,  boundary  merge,  and  group  phases  for  Grex. 
However,  it  needs  no  reduce  phase. 

6.  Page  View  Count  (PC):  In  the  page  view  count  benchmark,  the  input 
is  a  group  of  web  logs  in  the  form  of  (URL,  IP  address,  cookie)  tuples  and  the 
output  is  the  number  of  distinct  visits  to  each  URL  sorted  in  non-ascending 
order.  The  input  to  this  benchmark  is  variable  sized;  therefore,  this  application 
requires  the  parallel  split,  map,  boundary  merge,  and  group  phases.  In  addition, 
it  needs  the  reduce  phase  as  summarized  in  Table  2. 

We  have  tested  three  different  sets  of  inputs  for  each  application:  small, 
medium,  and  large,  similar  to  Mars  and  MapCG.  For  Word  Count  and  String 
Match  applications,  the  input  data  is  created  by  merging  various  e-books.  The 
input  for  Page  View  Count  application  is  created  by  randomly  combining  differ¬ 
ent  web  logs.  The  input  for  Inverted  Index  is  generated  using  a  simple  crawler. 
The  input  for  Matrix  Multiplication  and  Similarity  Score  are  randomly  gener¬ 
ated  using  the  input  generator  provided  by  Mars  [31]. 

6.2.  Experimental  Results 

The  performance  evaluation  results  in  terms  of  the  speedup  achieved  by  Grex 
over  Mars  and  MapCG  are  given  in  Figures  8  and  9,  respectively.  We  also  show 
the  throughput  achieved  by  Grex  for  the  large  data  sets  in  Table  2.  A  more 
detailed  discussion  of  the  performance  results  follows. 

1.  Similarity  Score  (SS):  For  this  benchmark,  Grex  achieves  up  to  approx¬ 
imately  2.4x  speedup  over  Mars  for  the  largest  data  set  as  shown  in  Figure  8. 
It  also  provides  roughly  2.5x  speedup  over  MapCG  for  the  medium  data  set 
as  shown  in  Figure  9.  The  performance  loss  we  observe  for  the  large  data  set 
in  Figure  9  is  due  to  the  increased  input  size  that  requires  the  reduce  phase 
to  combine  more  partial  results.  The  observed  speedups  are  due  to  better  use 
of  memory  hierarchy.  The  throughput  for  the  largest  dataset  with  Grex  is  56 
GB/s  from  the  task  release  to  completion. 
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Figure  8:  Speedup  of  Grex  over  Mars  (Delay  of  Mars:  86ms  -  10901ms) 
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Figure  9:  Speedup  of  Grex  over  MapCG  (Delay  of  MapCG:  53ms  -  2389ms) 


2.  Matrix  Multiplication  (MM):  For  the  largest  data  set,  which  requires  to 
multiply  two  2048  x  2048  matrices,  Grex  is  3.8x  faster  than  Mars.  Grex  shows 
2x  speedup  over  MapCG  for  the  small  set;  however,  the  speedup  decreases 
down  to  1.5x  for  the  large  data  set,  similar  to  SS.  Although  SS  and  MM  do  not 
need  parallel  split  or  lazy  emit,  Grex  substantially  enhances  the  performance 
by  taking  advantage  of  the  GPU  memory  hierarchy  and  tiling  technique.  For 
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the  fairness  of  comparisons,  tiling  is  used  for  the  baselines  as  well  as  Grex.  The 
throughput  of  Grex  is  92  GB/s  for  the  large  dataset.  Both  SS  and  MM  do  matrix 
multiplication.  However,  SS  requires  vector  multiplications  and  divisions,  which 
are  more  computationally  demanding  than  multiplying  numbers  and  adding  the 
partial  results  in  MM.  Hence,  for  SS  and  MM,  Grex  shows  different  performance 
in  terms  of  the  throughput  and  speedup  over  Mars  and  MapCG. 

3.  String  Match  (SM):  Grex  achieves  maximum  2.2x  speedup  over  Mars  for 
the  medium-size  input  and  3.7x  speedup  over  MapCG  for  the  largest  data  set. 
Grex  achieves  the  performance  improvement  due  to  parallel  split  and  efficient 
buffering  using  the  shared  memory  and  texture  cache.  The  throughput  of  SM  for 
the  large  dataset  is  36  GB/s.  By  comparing  the  throughput  achieved  by  Grex 
for  SS,  MM,  and  SM,  we  observe  that  processing  variable  size  data  is  more 
challenging  than  dealing  with  fixed  size  data  is.  As  a  result,  Grex  supports 
significantly  lower  throughput  than  processing  fixed  size  data. 

4.  Word  Count  (WC):  Grex  achieves  up  to  12. 4x  speedup  over  Mars.  Also, 
it  achieves  up  to  4x  speedup  over  MapCG.  Among  the  tested  benchmark  ap¬ 
plications,  Grex  achieves  the  highest  performance  improvement  for  word  count. 
This  is  because  word  count  requires  all  the  MapReduce  phases  and  the  amount 
of  data  to  process  can  be  decreased  only  after  the  reduce  phase  begins.  Since 
we  use  English  e-books  for  WC,  data  skews  impair  the  performance  of  Mars 
and  MapCG.  In  contrast,  Grex  evenly  divides  intermediate  <key,  value>  pairs 
to  the  reduce  tasks  to  avoid  performance  penalties  due  to  skewed  data  distri¬ 
butions.  Further,  Grex  applies  the  parallel  split  method,  while  applying  lazy 
emit  as  well  as  caching  to  decrease  the  frequency  of  global  memory  access.  The 
throughput  for  WC  in  Grex  is  14  GB/s. 

5.  Inverted  Index  (II):  This  benchmark  requires  the  map  and  grouping 
phases  of  MapReduce  to  invert  links  and  group  them  together;  however,  it  needs 
no  reduce  phase.  Due  to  parallel  split,  even  data  partitioning  among  map/re- 
duce  tasks,  and  efficient  buffering,  Grex  is  up  to  1.7x  and  2.3x  faster  than  Mars 
and  MapCG,  respectively.  The  measured  throughput  of  II  with  Grex  for  the 
large  dataset  is  11  GB/s.  II  deals  with  variable  sized  data.  In  addition,  it  has 
no  means  to  reduce  the  amount  of  data  during  the  computation,  because  it  only 
inverts  links  without  aggregating  them  at  all.  Thus,  Grex  achieves  the  lower 
throughput  than  it  did  for  WC. 

6.  Page  View  Count  (PC):  For  the  large  data  set,  Grex  is  2x  faster  than 
Mars  and  1.3x  faster  than  MapCG.  This  benchmark  needs  all  the  MapReduce 
phases,  similar  to  WC.  However,  Grex  achieves  relatively  small  performance 
gains  over  Mars  and  MapCG,  because  dealing  with  variable-size  URLs  requires 
more  complex  data  processing  than  handling  natural  language  words  in  WC. 
For  the  large  dataset,  the  throughput  of  Grex  is  8  GB/s.  The  throughput  is 
lower  than  that  of  II,  because  PC  is  more  compute-intensive  than  II  is.  In  PC, 
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intermediate  <key,  value>  pairs  have  to  be  sorted  and  aggregated  in  the  reduce 
phase. 

Finally,  Figure  10  shows  the  time  breakdown  of  the  Grex  phases  in  terms  of 
percentage.  I/O  for  moving  input  and  output  data  between  the  CPU  and  GPU 
takes  approximately  10%  of  the  total  job  execution  delay  on  average  with  the 
only  exception  of  SM.  In  SM,  however,  the  percentage  of  time  spent  for  I/O  is 
exaggerated  due  to  the  artifact  of  a  computationally  simple  map  phase.  Also, 
SM  needs  neither  a  grouping  phase  nor  a  reduce  phase.  Although  both  MM 
and  SS  are  matrix  benchmarks,  the  relatively  complex  reduce  phase  of  SS  takes 
about  50%  of  the  job  execution  time.  Notably,  textual  data  applications  that 
need  the  grouping  phase,  i.e. ,  word  count,  inverted  index,  and  page  view  count, 
spend  approximately  30%  —  60%  of  the  job  execution  time  for  sorting.  However, 
parallel  split  and  boundary  merge  phases  together  take  less  then  10%  of  total 
execution  time.  From  these  results,  we  claim  that  the  overhead  of  parallel  split 
and  boundary  merge  is  acceptable  considering  the  fact  that  (either  sequential 
or  parallel)  input  split  is  required  to  process  variable  size  data.  Moreover,  we 
observe  that  further  performance  gains  can  be  yielded  by  accelerating  the  sort 
phase,  while  decreasing  the  I/O  overhead.  A  thorough  investigation  is  reserved 
for  future  work. 

Overall,  Grex  has  achieved  substantial  performance  improvements  compared 
to  Mars  and  MapCG,  because  it  avoids  executing  the  map  and  reduce  phases 
twice  and  effectively  utilizes  the  GPU  memory  hierarchy,  while  supporting  par¬ 
allel  split  and  even  load  distribution  among  map/reduce  tasks.  Also,  Grex 
outperforms  MapCG  by  avoiding  serialized  memory  management  and  by  sup¬ 
porting  parallel  split,  even  data  partitioning  among  map/reduce  workers,  lazy 
emit,  and  efficient  memory  management. 
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7.  Conclusion  and  Future  Work 


In  this  paper,  we  present  a  new  GPU-based  MapReduce  framework  called 
Grex.  We  exploit  the  GPU  hardware  parallelism  and  memory  hierarchy  to  sig¬ 
nificantly  enhance  the  performance  especially  in  terms  of  the  delay.  We  strive 
to  avoid  creating  possible  performance  bottlenecks  or  load  imbalance  due  to 
inefficient  data  distribution  or  data  skews.  At  the  same  time,  we  avoid  using 
atomics  and  locks,  which  incur  potentially  high  contention  for  memory  access 
and  complexities  for  conflict  resolution.  To  achieve  these  goals,  we  have  de¬ 
veloped  a  number  of  new  methods  for  GPU-based  MapReduce  summarized  as 
follows: 

•  Grex  supports  parallel  split  that  replaces  the  sequential  split  step  provided 
by  most  existing  MapReduce  frameworks; 

•  Grex  partitions  data  to  map  and  reduce  tasks  in  a  parallel,  load-balanced 
fashion; 

•  Grex  supports  efficient  memory  access  by  leveraging  shared  memory,  tex¬ 
ture  cache,  and  constant  memory;  and 

•  It  supports  lazy  emit  to  further  decrease  the  frequency  of  global  memory 
access  and  the  resulting  delay. 

We  have  actually  implemented  and  experimentally  evaluated  Grex.  The 
results  of  the  performance  evaluation  indicate  up  to  12. 4x  and  4. lx  speedup 
over  Mars  [7]  and  MapCG  [8]  that  represent  the  current  state  of  the  art  in 
terms  of  GPU-based  MapReduce  computing. 

Given  the  promising  initial  results,  we  will  incorporate  Grex  with  Hadoop 
to  efficiently  process  large  amounts  of  data  with  minimal  delays.  We  will  also 
investigate  the  portability  of  Grex  to  AMD  GPUs. 
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PI-s:  Md  Enamul  Karim,  Vir  Phoha,  Sumeet  Dua 


The  AFOSR  support  Louisiana  Tech  University  received  under  the  grant  FA  9550-09-1-0165  is  utilized 
in  developing  two  complementary  experimentation  frameworks:  (i)  a  dual  channel  non-interactive 
protocol  based  lightweight  framework  for  identification,  storage  and  analysis  of  possibly  malicious 
traffic  and  (ii)  a  generic  distributed  social  network  framework,  which  could  be  extended  to  other  multi¬ 
sensing  based  network  analytics.  They  are  briefly  described  below: 

1.  A  framework  for  identification,  storage  and  analysis  of  possibly  malicious(/Botnet)  traffic 

Investigators:  Md  Karim  and  Vir  Phoha 

A  framework  is  designed  based  on  a  non-interactive  dual  channel  lightweight  protocol  {Figure  1)  we 
developed  that  can  be  used  for  the  detection  of  botnet  traffic.  The  realization  of  this  protocol  is 
achieved  using  Raspberry-PI  attachments  to  keyboards  {Figure  2)  and  thereby  facilitating  the  dual 
channels.  We  collected  network  traffic  data  from  systems  using  this  scheme  over  three  months  and 
conducted  some  preliminary  investigations  on  the  collected  data.  The  detail  is  available  in  reference  [1] 
which  will  be  presented  at  the  20th  Annual  Network  &  Distributed  System  Security  Symposium 
(NDSS  2013). 

The  protocol  we  have  developed  has  wide  applicably  in  lightweight  verification  and  analyses  of  the 
data  we  have  been  collecting  can  provide  some  very  unique  understanding  of  the  malicious  traffic  and 
possible  drifts  in  their  patterns.  AFRL  can  benefit  from  our  protocol  and  data  for 
authentication/verification  and  botnet  related  research. 


Alice 

Bob 

Generate  (r)  =  ( r{,r2 )  such  that: 

-  r2  can  verify  r\ 

-  r2  is  difficult  to  compute  from  r, 

-  Bob  has  a  way  to  verify  if  r2  is 
generated  by  Alice 
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Receive  (r'2) 
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Endif 

Figure  1.  Proposed  non-interaetive  dual  channel  protocol 


Figure  2.  Outline  of  the  prototype  for  malicious  traffic  detection 

2.  A  generic  distributed  social  network  framework 

Investigator:  Sumeet  Dua 

A  secure  cyber  space  heavily  relies  on  Situational  Awareness  (SA).  SA  is  described  as  the  recognition 
and  realization  of  the  performance  of  a  cyber-system.  An  effective  SA  relies  on  the  prediction  of  threats 
both  internal  and  external  to  the  cyber-system  that  does  not  disrupt  the  cyber-systems  ability  to  deliver  a 
set  of  deliverables  that  it  is  expected  to  support.  The  realization  of  SA  on  a  cyber-system  offers  its  set  of 
challenges,  of  which  handling  the  data  deluge  associated  with  such  a  system  is  vital.  Data  management 
and  data  analytics  in  a  cyber-system  is  vital  as  the  confidence  of  predicting  events  tied  down  to  the 
functioning  of  the  cyber-system.  These  predicted  events  are  succinct  to  the  creation  of  a  system  capable 
of  identifying  zero  day  attacks  rather  than  a  reactive  system.  Our  efforts  in  situational  awareness  have 
focused  on  a  generic  distributed  social  network  framework,  which  could  be  extended  to  other  multi¬ 
sensing  based  network  analytics.  The  equipment  allowed  us  to  accelerate  our  efforts  in  simulating 
and  performing  analytics  on  the  social  networks. 

Social  networks  have  transcended  from  a  means  of  casual  communication  to  the  glue  that  connects 
individuals  over  the  virtual  cyber  space.  Social  networks  and  the  associated  social  media  provide  a 
deluge  of  information  that  could  be  exploited  to  enhance  the  SA  of  a  system.  To  this  end,  cybemetics- 
the  science  of  analyzing  human  behavioral  patterns  to  gauge  the  behavior  of  a  community  is  of  growing 
importance.  Our  primary  has  been  on  elaborating  the  theoretical  underpinnings  of  tracking  dynamic 


communities  (groups  of  users)  in  social  networks  using  cybernetics.  Our  efforts  in  spatio-temporal 
associative  learning  are  applicable  here  as  they  provide  correlational  instances  of  user  and  community 
behavior.  We  have  investigated  techniques  employed  to  detect  new  events,  track  events  (with  time), 
and  summarize  events.  The  following  is  a  summary  of  all  the  activities  carried  out  under  this  grant: 
DURIP,  using  the  hardware  acquired  and  tested.  Figure  3  provides  a  conceptual  overview  of  the 
proposed  system  that  supports  analysis  of  large  volumes  of  instructed  data  used  in  this  project. 
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Figure  3.  A  schematic  representation  of  the  proposed  distributed  infrastructure  for  large  scale  data 
management 


(a)  HADOOP  CLUSTER:  Set  up  on  GPGPU’s 

The  main  objective  of  installation  of  HADOOP  on  the  two  GPGPU’s  is  to  leverage  the  GPU’s  compute 
power  in  a  scale  out  architecture.  By  scale  out  we  imply  that  this  infrastructure  will  provide  an  effective 
programming  model  for  large-volume  data  processing.  It  will  also  provide  for  specialized  aggregation 
queries  using  the  Map-Reduce  framework.  This  approach  was  chosen  to  enable  us  to  manage  our  own 
network  connectivity,  communication,  data  placement  and  movement,  load  balancing,  and  most 
important  fault  tolerance.  HADOOP  with  the  HDFS,  provides  for  all  the  above.  The  language  R  is  the 
primary  coding  language  used  to  exploit  this  distributed  data-mining  infrastructure  for  data 
preprocessing  and  data  transformation  routines.  We  ensure  that  the  distributed  map  reduce  functionality 
in  R  is  provided  using  RHIPE  and  distributed  data  mining  routines  are  supported  using  MAHOUT. 

(b)  Big  Data  Storage: 

The  Dell-Power  Edge  R710  storage  server  enables  storage  of  raw  data  (large  files).  This  nine  TB 
storage  acts  as  an  independent  data  management  server  used  to  store  and  backup  raw  data. 

(c)  Representative  Projects  benefiting  from  the  equipment: 


The  following  are  two  representative  projects  that  benefited  from  the  equipment  provided  through  the 
DURIP  project. 

i.  Social  Network  Data  Simulator: 

Simulation  of  network  data  is  critical  to  the  development,  testing,  and  evaluation  of  observed  trends, 
outliers,  and  potential  threats.  Developed  using  C++,  the  social  network  simulator  is  designed  to 
generate  synthetic  user  tagging  behavior  data  that  resembles  real  life  social  tagging  behavior  of  users  in 
a  social  network  (or  human  interactions  in  an  online  network).  Simulating  individual  users’  concepts 
when  reading  articles  generates  the  tags.  The  simulator  then  uses  these  tags  to  generate  social  links,  as 
well  as  taking  into  account  the  network  topography.  The  generic  structure  of  the  simulator  will  allow  us 
to  extend  this  to  other  spatio-temporal  domains. 

ii.  Network  Analysis: 

Higher  order  features  in  a  social  network  are  those  that  encompass  primary  features  like  users  and 
articles  (resources),  which  evolve  with  time.  This  evolution  makes  the  selection  of  higher  order  features 
in  an  evolving  community  a  challenge.  The  topics  of  interest  are  decided  using  the  users  tagging 
behavior  of  different  users  that  change  over  a  period  of  time.  The  challenges  encountered  when  trying 
to  track  changes  in  tagging  behaviors  are  the  reduction  or  termination  of  tag  usage,  an  addition  of  a 
newer  more  relevant  tag  to  the  existing  resource  being  used,  and  analysis  of  communities  with  unequal 
number  of  nodes  present  in  the  network.  This  algorithmic  framework  requires  data  management  across 
the  storage  servers  and  computational  learning  using  the  GPGPU  platform.  Future  efforts  will  entail 
using  time  series  discord  detection  approaches  for  similarity  search  and  data  filtering  schemas. 

3.  Publication(s)  resulted  from  this  support 

[1]  Irakiza,  D.,  Karim,  M.  E.  and  Phoha,  V.  V.,  A  dual  channel  protocol  for  assuring  pseudo¬ 
confidentiality,  Network  and  Distributed  System  Security  Symposium  (NDSS)  2013  (accepted  for 
short  talk). 
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Abstract 

We  introduce  a  non-interactive  dual  channel  authentication 
protocol  and  apply  it  to  long  distance  communication  for 
assuring  pseudo-confidentiality,  a  criteria  that  prevents  a 
malicious  agent  from  exfiltrating  information  to  unautho¬ 
rized  destinations.  Unlike  previously  proposed  protocols 
that  assume  a  manual  (human-aided)  or  equivalent  au¬ 
thenticated  channel,  our  protocol  utilizes  a  non-manual 
authenticated  channel.  We  analyze  the  security  properties 
for  a  possible  realization  of  this  protocol  and  develop 
a  prototype.  Through  a  Raspberry-Pi  implementation, 
we  show  how  the  incorporation  of  the  proposed  scheme 
into  the  future  design  of  keyboard  interfaces  may  impact 
authentication  practices. 

1.  Non-interactive  dual  channel  authentication 
protocols 

Non-interactive  dual  channel  authentication  protocols 
employ  two  channels  and  authenticate  information  r±  re¬ 
ceived  through  one  presumably  insecure  channel  using  a 
piece  of  brief  information  r2  (computed  from  rf)  received 
through  the  other  presumably  authenticated  channel.  To  re¬ 
main  lightweight  these  protocols  employ  one  way  commu¬ 
nication  from  a  message  sender,  Alice,  to  a  message  recipi¬ 
ent,  Bob.  It  can  be  shown  that  in  these  protocols,  imposters 
cannot  authenticate  themselves  if,  (i)  only  one  of  the  chan¬ 
nels  is  compromised  or  (ii)  both  channels  are  compromised 
but  attacks  on  them  are  not  coordinated. 

Existing  literature  describes  a  family  of  non-interactive 
message  authentication  protocols  (e.g.,  [1,  2,  4,  3,  5]), 
known  as  NIMAPs,  that  use  a  manual  (human-aided)  au¬ 
thenticated  channel.  In  these  protocols,  a  hash  value  for 
each  message  being  sent  is  generated.  Alice  then  transmits 

This  work  is  supported  by  AFOSR  Grants  FA9550-09-1-0479  and 
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the  message  and  hash  over  the  insecure  and  authenticated 
channels  respectively  to  Bob.  In  some  protocols,  a  key  is 
applied  to  the  hash  function  when  generating  the  hash  value 
and  this  key  is  sent  with  the  message  over  the  insecure  chan¬ 
nel  [3,  5]  or  with  the  hash  value  over  the  authenticated  chan¬ 
nel  [2], 

NIMAPs  assume  that  the  authenticated  channel  is 
human-aided  hence  adversarial  influence  on  this  channel  is 
significantly  limited.  This  assumption  is  sufficient  for  short 
distance  communications  requiring  infrequent  authentica¬ 
tion  of  messages.  However,  this  limits  their  application  in 
long  distance  communications  (e.g,  over  the  internet)  where 
frequent  authentication  is  required  and  a  human-aided  au¬ 
thenticated  channel  is  unrealistic  and/or  infeasible. 

2.  Proposed  protocol 

Replacement  of  the  human-aided  authenticated  channel 
in  NIMAPs  erodes  the  security  benefit  assured  by  such  a 
channel.  This  exposes  them  (NIMAPs)  to  channel  spoof  at¬ 
tacks  where  an  adversary.  Eve,  could  spoof  the  identity  of 
the  authenticated  channel  when  sending  her  authentication 
messages  to  Bob.  Because  of  the  non-interactive  nature  of 
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Generate  (r)  =  (r\,ri)  such  that; 

-  a*2  can  verify  r\ 
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Figure  1.  Proposed  protocol  (Generic) 


the  protocol.  Bob  has  no  way  to  know  whether  these  authen¬ 
tications  are  sent  by  Alice  hence,  he  will  accept  a  message  if 
its  authentication  is  valid  (i.e.,  it  authenticates  the  message 
sent  through  the  insecure  channel). 

We  address  this  issue  by  introducing  the  following:  (i) 
we  assign  the  task  of  Alice  to  a  small  hardware  attachment 
that  we  assume  not  to  be  affected  by  a  channel  compromise, 
and  (ii)  we  assume  that  it  is  difficult  for  the  adversary.  Eve 
to  compute  r2  given  n . 

Figure  1  presents  the  proposed  generic  protocol.  In  this 
protocol,  Alice  is  a  hardware  attachment  to  the  keyboard 
with  the  following  properties:  (i)  She  has  processing  capa¬ 
bilities  to  parse  typed  keystrokes  and  apply  a  method  gen¬ 
erate  to  identify  n  and  to  compute  r 2  from  rq ;  (ii)  She  only 
receives  input  from  the  keyboard;  (iii)  She  sends  output  to 
Bob  using  two  different  channels  (one  is  insecure  that  runs 
through  the  host  and  the  other  is  authenticated  that  bypasses 
the  host).  Bob  is  a  verifier  program  located  in  a  remote  com¬ 
puter. 

In  Figure  1  (and  in  subsequent  protocol  figures),  the 
added  to  information  received  by  Bob  indicates  that  the  sent 
and  received  values  might  be  different  and  the  insecure  and 
authenticated  channels  are  represented  by  and  “=>”  re¬ 
spectively. 

The  protocol  is  designed  based  on  the  following  assump¬ 
tions:  (i)  an  r 2  can  verify  the  associated  rq;  (ii)  an  adversary 
can  generate  an  rq  or  snoop  the  rq  generated  by  Alice  but 
it  is  difficult  for  her  to  compute  the  associated  r2  from  a 
given  n;  (iii)  Bob  can  compute  r2  from  rq;  (iv)  r[  is  ac¬ 
cepted  only  if  (r[,  r2 )  is  consistent  i.e.,  r2b  (the  expected  r'2 
computed  by  Bob  for  the  r[  he  received)  is  the  same  as  r2 . 

Alice  generates  rq  and  r2  from  the  keystrokes  typed  by 
a  user  and  sends  rq  through  the  possibly  compromised  host 
using  the  insecure  channel  and  r2  directly  to  Bob  using  the 
authenticated  channel.  When  Bob  receives  rq  and  r2  from 
Alice,  he  accepts  r\  only  if  (r[,  r2 )  is  consistent.  To  defeat 
this  protocol  Eve  either  (i)  waits  for  Alice  to  transmit  an  r1 
and  replaces  rq  with  r ieve  and  expects  that  it  will  be  verified 
by  the  associated  r2  (generated  and  transmitted  by  Alice), 
or,  (ii)  estimates  an  r2eve  for  her  rqeve  and  transmits  both  of 
them  to  Bob  pretending  that  the  r2eve  is  transmitted  over  the 
authenticated  channel  (by  spoofing  the  authenticated  chan¬ 
nel’s  identity). 

Existing  NIMAPs  do  not  make  assumption  (ii),  perhaps 
because  the  challenge  involved  with  its  realization.  Instead, 
they  assume  that  Eve  cannot  successfully  use  (or  masquer¬ 
ade  to  be  using)  the  manual  authenticated  channel.  These 
protocols  can  be  defeated  if  the  manual  channel  is  replaced 
by  a  non-manual  physical  channel  since  Eve  then  can  com¬ 
pute  the  right  r2l],::e  for  her  r ieve  and  masquerade  as  Alice. 


3.  A  hash  based  realization  for  assuring 
pseudo-confidentiality 

The  exfiltration  of  information  by  a  malicious  agent  from 
a  compromised  host  to  an  external  adversary  can  be  re¬ 
stricted  by  blocking  all  access  from  the  host  to  unautho¬ 
rized  destinations.  We  refer  to  an  authorized  destination 
as  a  domain  name/IP  that:  (i)  has  been  typed  at  least  once 
by  a  human  user  (establishing  consent/approval),  or  (ii)  has 
been  returned  in  a  response  packet  from  an  authorized  des¬ 
tination.  We  define  pseudo-confidentiality  as  the  security 
property  satisfied  when  an  adversary  cannot  exfiltrate  infor¬ 
mation  to  unauthorized  destinations. 
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Figure  2.  Hash  based  realization 

A  hash  based  realization  of  the  proposed  protocol  assur¬ 
ing  pseudo-confidentiality  is  shown  in  Figure  2.  In  this  real¬ 
ization,  the  generate  function  used  by  Alice  produces  com¬ 
ponents  rv  and  h.  rv  is  the  destination  (IP/domain  name)  of 
a  request  being  made  and  h  =  Hx(rv)  is  the  hashed  value 
of  the  requested  destination.  This  realization  works  in  two 
phases  as  described  below: 

(i)  Initialization  phase 

This  is  a  one  step  setup  process  required  prior  to  the  estab¬ 
lishment  of  the  rest  of  the  protocol.  Dining  this  phase,  a 
human  operator  (i)  informs  Bob  of  the  functions  H  and  H 
used  by  Alice  and  (ii)  initializes  a  K  for  Alice  and  Bob. 
During  the  protocol  phase,  function  H  uses  K  as  a  parame¬ 
ter  to  produce  h  and  function  //  is  used  to  update  I\  by  both 
Alice  and  Bob.  Due  to  some  failure  or  interruption  in  com¬ 
munication,  if  Alice  and  Bob  do  not  have  the  same  updated 
K,  all  destination  requests  will  be  rejected  and  the  value  of 
K  will  have  to  be  re-initialized. 

(ii)  Protocol  phase 

In  this  phase,  components  rv  and  h  are  transmitted  from 
Alice  to  Bob  via  the  insecure  and  authenticated  channels 


respectively.  When  Bob  receives  both  r'v  and  h! ,  he  com¬ 
putes  his  own  h  using  his  K  i.e.,  h  =  Hk{t'v)  and  considers 
(r'v,  h')  to  be  consistent  if  the  received  h!  =  h.  The  value  of 
K  is  updated  by  both  Alice  and  Bob  using  another  function 
H.  Alice  updates  her  K  after  sending  h  to  Bob  and  Bob 
updates  his  K  every  time  he  receives  an  h!  ^  0  from  Alice 
and  accepts  r'v .  H(K)  transforms  the  value  of  I\  based  on 
the  current  value  of  I\  to  produce  a  new  K. 

During  experimentation  we  relax  the  definition  of  con¬ 
sistent  by  accepting  requests  to  destinations  that  are  (i)  non- 
typed  but  had  been  authorized  before  and  (ii)  received  in 
response  to  requests  to  authorized  destinations,  to  improve 
the  usability  of  the  solution. 

Security  Analysis 

Bob  accepts  r'v  if  (r'v,  h)  is  consistent  for  the  current  K 
(that  Bob  has  knowledge  of).  Bob  can  be  made  to  accept  a 
request  from  Eve  in  the  following  cases: 

Case  1 :  Channel  spoof  attack 

A  malicious  agent.  Eve,  residing  in  the  host,  can  no 
longer  launch  a  channel  spoof  attack  simply  by  sending  her 
own  rv  and  h  to  Bob  if  the  K  she  uses  for  computing  h  (at  a 
certain  instance)  is  not  what  Bob  is  expecting  Alice  to  use. 
Since  our  assumption  is  that  Eve  has  no  knowledge  of  K, 
Alice  can  only  expect  that  the  K  she  is  using  matches  with 
the  K  Bob  is  expecting.  For  an  n-bit  I\ ,  the  probability  of 
such  a  match  is  2~n. 

Case  2:  Second  preimage  Attack 

Eve  can  replace  the  destination  rv  generated  by  Alice 
with  her  own  destination  reve  and  expect  that  there  exists  a 
K"  such  that  H(rv,K )  =  H(reve,I\").  A  second  preim¬ 
age  resistant  hash  function  is  relevant  in  our  case  because 
for  a  second  preimage  resistant  hash  function,  given  an  in¬ 
put  rv,  it  should  be  difficult  for  an  adversary  to  find  an¬ 
other  input  reve  ^  rv  such  that  H(reve)  =  H(rv).  If  H 
is  es  second  preimage  resistant  (i.e.,  es  is  the  probability 
of  H(reve)  =  H(rv )  occurring),  then  the  probability  of  a 
successful  attack  is  es. 


4.  Results 


We  setup  a  prototype  as  shown  in  Figure  3  using 
Raspberry-Pis  (model  B)  as  the  hardware  attachments  to  the 
keyboard  and  collected  data  for  over  four  months.  As  men¬ 
tioned  in  section  3,  our  experimentation  considers  a  relaxed 
definition  of  consistent  as  reflected  in  the  logic  for  Pseudo 
Confidentiality  Verifier  in  Figure  3.  Figure  4  shows  typical 
results  for  a  representative  seven  day  period.  As  expected, 
all  illegitimate  requests  were  denied  (requests  not  initiated 
by  the  user).  Some  legitimate  requests  were  also  denied 
because  associated  destinations  were  not  typed  once  before 
their  uses.  Figure  4  (b)  shows  the  breakdown  of  the  legiti¬ 
mate  typed  and  un- typed  requests. 


Figure  3.  Raspberry-Pi  based  prototype 


Figure  4.  %  of  requests  accepted  per  day 


References 

[1]  D.  Balfanz,  D.  K.  Smetters,  P.  Stewart,  and  H.  C.  Wong.  Talk¬ 
ing  to  strangers:  Authentication  in  ad-hoc  wireless  networks. 
In  NDSS,  2002. 

[2]  C.  Gehrmann,  C.  J.  Mitchell,  and  K.  Nyberg.  Manual  authen¬ 
tication  for  wireless  devices.  RSA  Cryptobytes,  7(  1):29 — 37, 
Spring  2004. 

[3]  A.  Mashatan  and  D.  R.  Stinson.  Noninteractive  two-channel 
message  authentication  based  on  hybrid-collision  resistant 
hash  functions.  IACR  Cryptology  eP rint  Archive,  2006:302, 
2006. 

[4]  S.  Pasini  and  S.  Vaudenay.  An  optimal  non-interactive  mes¬ 
sage  authentication  protocol.  In  Proceedings  of  the  2006  The 
Cryptographers’  Track  at  the  RSA  conference  on  Topics  in 
Cryptology,  CT-RSA’06,  pages  280-294,  Berlin,  Heidelberg, 
2006.  Springer- Verlag. 

[5]  M.  R.  Reyhanitabar,  S.  Wang,  and  R.  Safavi-Naini.  Non¬ 
interactive  manual  channel  message  authentication  based  on 
etcr  hash  functions.  In  Proceedings  of  the  12th  Australasian 
conference  on  Information  security  and  privacy,  ACISP’07, 
pages  385-399,  Berlin,  Heidelberg,  2007.  Springer- Verlag. 


Final  Report  of  Tennessee  State  University  Equipment  Subcontract 
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PI:  Sachin  Shetty 

The  AFOSR  equipment  grant  was  instrumental  in  supporting  research  projects  in  IP  geolocation 
and  cloud  carrier  security.  The  equipment  will  also  support  newly  funded  research  projects  on 
embedded  systems  security  and  smartphone  security.  In  addition,  undergraduate  and  graduate 
students  also  leveraged  the  equipment,  to  implement  their  respective  senior  design  projects  and 
theses.  Following  is  an  overview  of  how  the  equipment  affected  current  projects,  new  projects, 
student  projects  and  theses,  future  impacts  and  AFRL  collaboration 

1.  Current  Projects 

IP  Geolocation:  We  developed  a  classification  based  method  to  geolocate  nodes  on  the  internet 
with  higher  reliability,  robustness  and  sensitivity  than  the  current  state-of-the  art  measurement- 
based  IP  geolocation  techniques.  The  equipment  grant  allowed  us  to  purchase  two  Dell 
PowerEdge  M  520  servers  (quad  core)  for  training  and  testing  the  machine  learning  classifier. 
The  presence  of  large  number  of  measurements  in  our  experiment  necessitated  the  need  for 
developing  a  parallel  implementation  of  our  classifier  on  quad  core  servers.  The  parallel 
implementation  provided  a  faster  speedup  and  we  were  able  to  publish  results  of  our  work  in 
Milcom  conference.  Recently,  we  extended  the  work  to  perform  visualization  of  IP  geolocation 
results  on  Google  maps.  For  this  work,  we  extensively  used  the  quad  core  servers  to  implement 
the  complete  software  system.  A  paper  based  on  this  work  has  been  submitted  to  a  conference. 

1)  Hellen  Maziku,  Sachin  Shetty,  Keesook  Han,  Tamara  Rogers.  Enhancing  the 
Classification  Accuracy  of  IP  Geolocation.  Proc.  of  Military  Communications 
Conference,  Orlando,  Florida,  Nov  2012. 

2)  Asmah  Muallem  and  Sachin  Shetty.  Visualization  of  IP  Geolocation  of  Spam  Email, 
submitted  to  IEEE  ComCAP  2013 

Cloud  Carrier  Security:  In  this  research  effort,  we  developed  a  risk  assessment  framework  to 
investigate  the  security  risk  of  the  cloud  carrier  between  cloud  users  and  two  commercial  cloud 
providers.  The  risk  assessment  framework  examined  the  security  vulnerabilities  of  operating 
systems  of  routers  within  the  cloud  carrier.  The  framework  provides  the  quantifiable  security 
metrics  of  each  cloud  carrier  which  enables  users  to  assess  the  security  services  provided  by 
cloud  providers.  We  used  a  Markov  chain  for  our  framework  and  the  operating  system 
vulnerability  information  was  collected  from  hundreds  of  routers  between  100  PlanetLab  nodes 
and  datacenters  for  two  cloud  providers.  We  used  the  equipment  grant  to  purchase  two  quad  core 
workstations  for  evaluating  the  Markov  chain  framework.  We  have  submitted  the  results  of  this 
research  effort  to  one  conference. 


1)  Swetha  Lenkala,  Sachin  Shetty  and  Kaiqi  Xiong.  Security  Risk  Assessment  of  Cloud 
Carrier,  submitted  to  ACM  CCGRID  2013 

2.  New  Projects 

The  AFOSR  equipment  grant  provided  the  leverage  to  attract  two  funded  projects  from  Boeing 
and  AFRL.  The  Boeing  research  project  involves  the  modeling  of  wireless  attacks  on  a  mobile 
cyber  physical  system.  An  experimental  testbed  comprising  of  10  mobile  robots  has  been 
developed  with  normal  and  attack  traffic.  We  are  currently  collecting  RF  signals  which  will  be 
later  used  for  modeling  the  attack  behavior.  The  servers  and  workstations  will  be  utilized  to 
evaluate  the  attack  model.  The  AFRL  project  focuses  on  the  cloud  based  detection  of  malicious 
websites  on  smartphones.  The  nodes  purchased  from  the  equipment  grant  will  be  included  in  the 
development  of  a  small  cloud  testbed  to  evaluate  the  detection  mechanism. 

1)  "Development  of  an  agile  and  resilient  security  architecture  to  protect  embedded  systems 
from  emerging  cyber  threats",  Boeing  contract,  11/1/2012  -  10/31/2013. 

2)  "Cloud  computing  based  detection  and  response  system  to  combat  malware  infecting  web 
browsers  on  smartphones”,  AFRL  MLP  contract.  12/1/2012  -  11/30/2013. 

3.  Student  Projects/theses 

Two  undergraduate  students  used  the  equipment  grant  to  complete  their  capstone  senior  design 
projects  on  cloud  carrier  security  and  insider  threat  detection.  Two  Master  students  completed 
their  theses  with  the  help  of  the  resources  provided  by  the  equipment  grant.  Currently,  5 
undergraduate  and  5  graduate  students  are  directly  and  indirectly  benefitting  from  the  resources 
provided  by  the  equipment  grant. 

4.  Future  Impacts 

The  equipment  grant  will  be  very  beneficial  to  support  our  current  research  projects  and  will  be 
featured  in  our  facilities  description  as  we  submit  future  proposals  for  funding.  The  equipment 
grant  has  played  a  crucial  role  in  the  development  of  the  TSU  cloud. 

5.  AFRL  Collaboration 

The  resources  from  the  equipment  grant  are  currently  leveraged  to  implement  the  project 
activities  outlined  in  an  AFRL  MLP  contract  on  smartphone  security.  One  of  the  requirements 
of  the  project  is  the  presence  of  a  cloud  environment.  The  equipment  grant  has  helped  us  develop 
the  cloud  platform  at  TSU  which  will  provide  an  excellent  testbed  to  evaluate  the  project 


activities  outlined  in  the  AFRL  MLP  contract.  We  are  confident  that  the  sustainability  of  the 
equipment  will  be  maintained  by  the  AFRL  MLP  contract  and  similar  grants/contract  from 
AFRL  and  other  funding  agencies. 
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Visualizing  Geolocation  of  Spam  Email 


Asmah  Muallem,  Sachin  Shetty,  and  S.  K.  Hargrove 


Abstract —  Viruses  and  phishing  scams,  as  a  result  of 
spam,  are  increasingly  becoming  numerous.  Spontaneous 
methods  used  by  spammers  present  a  threat  in  spam 
prevention.  Tools  for  spam  identification  and  prevention 
are  increasing  but  lack  presentation  fundamentals.  Online 
databases  help  determine  the  locations  of  spammers  and 
provide  more  information  using  an  IP  address.  The 
integration  of  components  used  in  this  process  play  an 
important  role  in  how  users  use  this  information.  A 
primary  concern  is  lack  of  visualization  tools  to  effectively 
analyze  the  information.  In  this  paper,  we  develop  a 
visualization  tool  to  represent  the  geographical  locations  of 
spammers  based  on  the  integration  of  MaxMind  and 
WhoIS  databases  along  with  the  Google  Maps  API.  The 
visualization  tool  provides  a  visualization  of  spam 
origination  along  with  patterns  of  spammers  identified 
from  spam  activity.  A  key  component  in  the  development 
of  this  tool  is  its  extensible  framework  allowing  for  the 
addition  of  resources  to  retrieve  more  information  about  a 
spammer  and  analyze  additional  patterns  of  spammers  for 
spam  analysis. 

Index  Terms —  Network  Visualization,  Security  Visualization, 
Spam  Analysis,  IP  Geolocation. 


I.  Introduction 

Email  has  been  gaining  popularity  since  the  early  1990’s, 
and  has  become  the  standard  for  electronic  communication. 
Email  has  become  the  primary  source  of  communication  due  to 
its  storage,  organization,  and  flexibility.  With  social  networks 
such  as  Facebook  and  Twitter  providing  electronic 
communication,  email  still  remains  the  most  popular  online 
activity.  Most  companies  ban  the  use  of  social  networks  due  to 
its  flexibility  of  distributing  both  professional  and  non¬ 
professional  material  and  information.  Advertisers  facilitate 
email  as  a  means  of  inexpensively  spreading  advertisement 
which  sometimes  lead  to  excessive  spam  in  a  user’s  mailbox. 
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Spam  fills  inboxes  with  useless  emails  and  exposes 
Advertisement  spam  usuallyviruses  and  phishing  scams  if  the 
advertisers  webmail  service  or  ISP  is  hacked.  Filtering  spam  as 
a  result  of  this  problem  becomes  an  email  provider’s  worst 
nightmare.  Providing  an  efficient  spam  filter  helps  detect  new 
spam  and  remove  it  from  the  user’s  inbox  but  does  not  prevent 
it.  Some  problems  of  spam  are  network  traffic  delays,  storage, 
and  the  serious  security  threats  it  poses.  The  techniques 
created  to  detect  and  prevent  spam  focus  on  analyzing  patterns 
in  datasets  of  email  from  various  users  for  a  given  location.  A 
concern  for  users  is  the  privacy  of  emails,  which  requires 
current  tools  to  parse  and  filter  sensitive  data  from  the  emails 
prior  to  analyzing.  The  parsing  of  disparate  and  massive 
datasets  to  implement  a  visual  analytics  framework  incurs 
additional  overhead.  In  this  paper,  we  deploy  IP  Geolocation 
as  a  tool  to  provide  a  user  friendly  interface  for  visualizing  the 
geographical  locations  of  spam  attackers  for  a  particular  email 
account  on  a  Google  map  with  analysis  capabilities  and  an 
extensible  framework  for  additional  resources  to  identify 
patterns  in  spam  attacks. 

Spam  is  a  common  type  of  cyber  nuisance  that  not  only  wastes 
resources  but  also  poses  security  threats  [1],  Determining  the 
geographic  location  of  an  Internet  host  is  valuable  for  a 
number  of  Internet  applications.  [11].  IP  Geolocation  has 
found  applications  in  cloud  security,  especially  in  the  ability  to 
accurately  determine  the  location  of  data  in  a  cloud  computing 
environment.  The  use  of  IP  Geolocation  benefits  network 
analysts,  police  authority,  and  ordinary  users  because  of  its 
ability  to  address  different  avenues  in  network  security. 

Recently,  there  have  been  some  efforts  in  the  area  of  visual 
analytics  of  emails.  Analyzing  and  managing  one’s  mailbox  or 
archive  have  been  implemented  through  visualizations  [7,  10]. 
Visualization  techniques  designed  to  show  patterns  in 
incoming  email  which  can  reveal  misidentified  pieces  of  spam, 
common  spam  sources,  and  patterns  such  as  periods  of 
increased  spam  activity  have  also  been  implemented  [2]. 
Analytical  models  that  use  coefficient  vectors,  ‘density’  and 
‘input’,  with  visual  clustering  methods  to  classify  and  display 
the  spam  emails  have  been  introduced  for  the  analysis  of  spam 
email  viruses  and  attacks  [1],  We  introduce  an  approach 
extending  previous  tools  to  identify  the  source  of  spam  based 
not  only  on  email  information  but  ISP  provider  and  region 
information.  In  this  paper  we  present  the  design  and 
implementation  of  a  tool  utilizing  the  techniques  of  IP 
Geolocation  to  visualize  the  geographical  locations  of  spam 
attackers  using  a  Google  map  with  analysis  controls  to  identify 
patterns  of  spam  attackers. 
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The  organization  of  the  paper  is  as  follows.  We  present  the 
system  model  in  Section  II.  Our  approach  in  detail  is  presented 
in  Section  III.  In  the  following  section  we  describe  the 
implementation  of  our  visualization  system  prototype.  We 
present  the  results  demonstrating  the  evaluation  of  the 
application  in  Section  V.  We  conclude  with  future  work  in 
Section  VI. 


II.  SYSTEM  MODEL 

Our  model  is  based  on  visualizing  the  geographical 
location  of  spam  attackers  using  a  web  application.  A  primary 
element  in  preventing  spam  is  investigating  the  source.  Our 
system  model  (Fig.  1)  comprises  of  data  acquisition,  database 
design,  and  visualization  phases.  We  process  spam  information 
from  email  accounts  through  the  web  application  and  generate 
a  visualization  of  the  geographical  locations  of  the  spam 
attackers. 

The  first  phase  is  data  acquisition  and  is  part  of  the  spam 
generation  process.  We  generate  spam  email  using  a  test  email 
account  and  gather  the  information  when  the  user  logs  into  the 
email  account  through  the  web  application.  We  use  a  MySQL 
database  to  store  the  emails.  We  integrate  into  our  database 
MaxMind  and  WhoIS  information  (geolocation,  ISP,  etc) 
corresponding  to  the  spam  emails  gathered.  The  last  and  most 
important  phase  is  visualization.  Finally,  in  the  visualization 
phase,  relevant  data  associated  with  the  email  account  is 
retrieved  to  display  the  geographical  locations  of  the  spam 
attackers  along  with  analysis  controls  on  a  Google  map.  The 
visualization  analysis  controls  help  determine  patterns  in  spam 
attacks  (for  ex.,  frequency  of  attacks  from  a  given  location  or 
ISP  over  a  period  of  time)  or  provide  additional  information  to 
perform  further  analysis  and  help  prevent  future  spam  attacks. 


Fig.  1.  System  model 

III.  APPROACH 

Our  objective  is  to  design  a  visualization  tool  to  determine 
the  geographical  locations  of  spam  attackers  for  analysis 
purposes  and  spam  prevention.  Our  system  model  presents  a 
novel  approach  for  preventing  spam  by  integrating  databases 
and  visualization  tools  to  come  up  with  a  prototype  for 
visualizing  the  location  of  spam  attackers.  Through  this  model 
we  can  detect  spam  attackers  which  use  different  email 
addresses  associated  with  the  same  domain  to  send  spam 
emails. 


Our  approach  provides  detailed  information  about  the  spam 
attacker  along  with  the  geographical  location  to  identify  the 
main  source  of  spam  and  give  further  analysis  to  help 
determine  a  pattern.  The  current  web  tools  use  IP  addresses  to 
query  databases  for  information  about  spam  attackers  but  do 
not  have  the  capability  of  obtaining  all  information  at  one  time 
or  from  one  place.  This  process  involves  using  multiple 
websites  to  retrieve  the  IP  address  of  the  spam  attacker  and  the 
information  about  the  spam  attacker.  This  method  can  cause 
discrepancies  since  different  databases  and  websites  are  used 
in  the  process.  Our  approach  enhances  this  capability  by 
offering  one  tool  to  implement  all  of  these  actions  along  with 
providing  a  user  friendly  visualization  and  analysis  controls  to 
gather  more  information  about  the  spam  attacker. 

A.  Data  Acquisition 

We  create  a  test  account  in  Gmail  to  initiate  the  process  of 
retrieving  spam.  We  believe  Gmail  will  serve  as  the  best  email 
provider  for  our  research  since  it  is  associated  with  several 
online  services  such  as  YouTube.  YouTube,  blogs,  and  online 
discussion  groups  can  lead  to  excessive  spam  emails  in  a 
user’s  mailbox.  Gmail’s  spam  filter  is  decent  but  does  not 
always  automatically  mark  spam  emails  as  spam  and  these 
emails  can  still  be  delivered  to  the  user’s  mailbox  if  they  have 
different  email  address  but  the  same  domain  name.  We  select 
outlook  as  the  second  email  provider  due  to  its  flexibility  in 
allowing  several  different  email  accounts.  Gmail’s  spam  folder 
is  equivalent  to  Outlooks’  Junk  E-mail  folder.  We  create 
Hotmail  and  Gmail  email  accounts  to  sign  up  with  online 
services  such  as  YouTube  and  join  several  online  and  Google 
discussion  groups.  We  generate  spam  a  few  months  prior  to 
the  implementation  of  the  tool.  During  the  first  month,  Gmail’s 
test  account  accumulated  approximately  200  or  more  spam 
email  and  Hotmail’s  test  account  accumulated  approximately 
1 00  or  more  j unk  mail. 

B.  Database  Design 

We  use  MySQL  to  store  and  manipulate  data.  MySQL  is  an 
open  source  relational  database  management  system  and  can 
be  integrated  with  programming  languages  in  Visual  Studio 
and  Java.  We  store  MaxMind  data  prior  to  storing  spam  email 
information.  MaxMind’s  IP  Geolocation  data  and  services 
provide  Geo  IP  which  consists  of  country,  region,  postal  code, 
city,  area  code,  and  latitude/longitude  coordinate  information 
for  a  given  IP  address.  Geo  IP  data  are  composed  of  user- 
centered  location  data  gathered  through  online  sites  that  store 
web  visitor’s  geographic  locations.  Millions  of  datasets  run 
through  algorithms  to  identify,  extract,  and  produce  location 
points  for  IP  addresses. 

MaxMind  GeoIP  offers  paid  and  free  services  and  some  of 
the  free  services  include  downloadable  files  of  Geo  IP  data  in 
csv  and  binary  formats  and  APIs  for  retrieving  and 
manipulating  Geo  IP  data.  Our  database  imports  downloaded 
GeoLite  City  Geo  IP  data  files.  GeoLite  City  contains  two  csv 
files:  blocks,  and  location.  The  location  file  includes  country. 
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city,  state,  region,  postal  code,  area  code,  and 
latitude/longitude  information.  The  blocks  fde  includes  IP 
addresses  in  long  format.  A  column  called  location  is  included 
in  both  fdes  to  match  the  IP  address  range  to  a  location.  Our 
database  also  integrates  WholS  information.  WhoIS  queries 
databases  that  store  registered  users  or  assignees  of  an  Internet 
source.  The  query  is  initiated  using  telnet  and  a  successful 
acknowledgement  returns  the  ISP  and  domain  registered 
information  for  a  given  IP  address. 

We  use  ASP. Net  and  C#  along  with  ADO. Net  to  connect  to 
MySQL  to  import  downloaded  MaxMind  GeoIP  data  and  store 
spam  email  information  into  the  MySQL  database.  We  use 
Koolwired,  an  open  source  DLL  of  a  C#  implementation  of 
IMAP,  to  connect  to  Gmail’s  IMAP  configuration  through  C# 
and  the  Microsoft.Office.  Interop.  Outlook  library,  which 
provides  support  for  interoperability  between  the  COM  object 
model  of  Microsoft  2010  and  managed  applications  that 
automate  Outlook,  to  connect  to  Outlook.  We  parse  and  gather 
spam  from  the  user’s  Junk  E-mail/spam  folder  through  the  C# 
application  using  the  user’s  login  credentials  after  importing 
MaxMind  Geo  IP  data.  We  store  sender,  receiver,  subject,  and 
date/time  information.  We  also  store  additional  information 
such  as  transport  headers  from  Outlook  emails.  Transport 
headers  are  useful  for  further  analysis  to  identify  a  pattern  in 
the  encoding  of  spam  emails.  We  determine  the  geographical 
location  coordinates  of  the  spam  attacker  during  the  process  of 
storing  spam  email  information  into  our  database. 
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Fig.  2.  MySQL  database  design 

We  use  a  MySQL  Routine,  insertEmailCoordinates,  to 
identify  the  location  coordinates  associated  with  the  IP  address 
range  in  MaxMind  for  the  spam  attacker’s  IP  address.  We  use 
the  C#  method,  Dns.GetHostAddresses  (),  to  translate  the 
parsed  email  domain  name  into  an  IP  address.  We  convert  the 
IP  address  to  long  format  to  allow  for  the  capability  of  passing 
the  IP  address  to  insertEmailCoordinates  for  retrieving 
location  coordinates.  We  obtain  WhoIS  information  about  the 


spam  attacker  using  the  translated  IP  address  in  the  last  step  of 
the  database  design  phase.  We  gather  information  from 
WhoIS  using  a  telnet  session  initiated  through  the  C#  web 
application  and  query  the  WhoIS  database  with  the  translated 
IP  address  as  a  parameter.  The  WhoIS  information  returns  in 
different  formats  based  on  the  IP  address,  in  which  we  address 
by  designing  an  algorithm  to  handle  all  formats  and  prevent 
errors.  The  tables  within  our  MySQL  database  include 
Allocation,  tblblocks,  tblusers,  tblwhoisinformation, 
tblemailinformation,  and  tblemailcoordinates  which  are 
illustrated  in  Fig.  2. 

C.  Visualization 

Our  tool  can  help  detect  patterns  of  spam  attackers  through 
our  analysis  and  visualization  controls.  We  implement  the 
visualization  of  geographical  locations  for  spam  attackers 
using  the  C#  web  application  responsible  for  the  process 
discussed  in  the  database  design  phase.  We  provide 
visualization  by  manipulating  collected  data  and  using  the 
Google  Maps  API.  The  main  source  for  the  visualization  is 
based  on  the  geographical  location  coordinates  determined 
from  insertEmailCoordinates  and  WhoIS  information  stored  in 
our  database  during  the  database  design  phase.  We  connect  to 
the  database  using  ADO. Net,  retrieve  the  WhoIS  information 
and  geographical  coordinates  and  store  them  in  a  C#  data 
structure.  The  JavaScript  functions  in  ASP.net  parse  the 
information  from  the  data  structure  and  send  it  to  the  Google 
Maps  API  functions.  We  design  JavaScript  functions  in  the 
ASP. Net  code  to  generate  multiple  markers  on  the  Google  map 
identifying  locations  of  different  spam  attackers  with  info 
windows  for  each  marker  providing  detailed  information  about 
each  spam  attacker. 

Currently,  we  are  only  displaying  WhoIS  information  related 
to  the  IP  address  in  the  info  window.  We  are  in  the  process  of 
providing  more  detailed  information  about  the  spam  attacker, 
and  other  map  control  options  for  different  views  of  the  map. 
A  single  marker  is  displayed  on  the  Google  map  to  represent 
multiple  spam  attackers  from  the  same  location,  preventing  the 
duplication  and  excessiveness  of  markers  on  the  Google  map. 
Our  research  uses  latitude  and  longitude  coordinates  to  display 
the  locations  on  the  Google  map  versus  using  regions. 
Currently,  our  tool  provides  filtering  of  geographical  locations 
of  spam  attackers  based  on  domain  name,  ISP,  and  the  region 
associated  with  the  most  spam.  We  are  in  the  process  of 
providing  more  tools  in  the  fdtering  panel  for  temporal 
analysis  and  fdtering  spam  attacker  locations  by  date,  country, 
city,  state,  and  email  or  IP  address.  We  use  different  color 
coding  schemes  to  represent  regions  with  the  most  spam 
emails.  We  will  extend  the  map  options  for  different 
visualization  features  to  include  the  map  options  available  in 
the  Google  Maps  API.  We  can  help  identify  patterns  of  spam 
attackers  through  the  analysis  controls  within  this  tool.  We 
illustrate  the  visualization  components  in  Fig.  3. 
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Fig.  3.  Visualization  components  of  tool. 


IV.  IMPLEMENTATION 

A  prototype  of  our  implementation  uses  test  accounts 
created  in  Gmail  and  Hotmail.  We  import  MaxMind  Geo  IP 
data  into  the  database  using  the  process  discussed  in  the 
database  design  phase.  We  select  the  Gmail  test  account 
during  the  initial  phase  of  implementation  since  it  contains 
more  spam  emails.  We  log  into  the  test  email  accounts 
through  a  login  interface  within  the  C#  web  application  to 
initiate  the  process  discussed  in  the  database  design  phase 
for  storing  spam  email  and  WholS  information,  and 
identifying  geographical  location  coordinates.  During 
implementation,  we  found  discrepancies  between  MaxMind 
and  WholS  for  a  small  percentage  of  IP  addresses  when 
geographical  location  coordinates  were  returned  from 
MaxMind.  We  implemented  further  analysis  on  these 
discrepancies  using  the  MySQL  Workbench  GUI  and  found 
false  positives  because  the  location  of  the  IP  address 
returned  from  MaxMind  did  not  match  the  location 
registered  with  WholS.  We  resolve  this  issue  by  identifying 
multiple  addresses  in  WholS  for  the  IP  address  to  give  one 
address  which  will  match  the  location  returned  from 
MaxMind.  We  also  found  locations  in  MaxMind  which  did 
not  have  an  address  in  WholS. 


V.  RESULTS 

We  visualize  geographical  locations  of  spam  attackers 
after  viewing  the  list  of  collected  spam  email  information.  The 
visualization  contains  multiple  markers  on  the  Google  map 
representing  the  geographical  locations  of  spam  attackers 
along  with  info  windows  providing  WholS  information 
invoked  when  clicking  the  marker.  The  Visualization  produced 
3  markers  on  the  Google  map  to  represent  the  200  spam  emails 
in  the  Gmail  test  account.  The  markers  represented  were  based 
off  of  spam  attackers  with  different  IP  addresses  but  from  the 
same  geographical  location  preventing  duplicate  markers  on 
the  map.  We  fdter  by  the  Gmail  domain  returning  2  markers 
representing  geographical  locations  of  spam  attackers  using 
Gmail  accounts.  We  also  filter  by  the  klhh333oextinction.com 
domain,  returning  1  marker  representing  a  spam  attacker 
located  in  Arizona.  The  ISP  filter  is  used  to  filter  spam 
attacker  locations  based  on  the  registered  ISP  in  WholS.  The 
last  step  in  our  evaluation  involves  checking  the  option  to 
highlight  the  region  with  the  most  spam  email,  which  is  CA. 

Logging  into  the  Gmail  test  email  account,  storing  email 
information  into  the  database,  and  displaying  associated  email 
information  is  displayed  in  Fig.  4.  Visualizing  the  Geolocation 
of  Spam  attackers  for  the  test  email  account  on  the  Google 
map  is  displayed  in  Fig.  5.  Zooming  into  a  desired 
geographical  location  on  the  Google  map  is  shown  in  Fig.  6. 
Filtering  by  the  Gmail.com  domain  to  visualize  spam  attackers 
using  Gmail.com  and  viewing  WholS  information  related  to 
the  spam  attacker  is  shown  in  Fig.  7.  Filtering  by  the 
klhh333oextinction.com  domain  to  visualize  spam  attackers 
using  klhh333oextinction.com  and  viewing  WholS 
information  related  to  the  spam  attacker  is  shown  in  Fig.  8. 
The  last  figure  displays  checking  the  top  ISP  region  checkbox 
to  highlight  the  region  with  the  most  spam. 
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Our  analysis  determined  that  this  was  due  to  lack  of 
fresh  updates  from  MaxMind  since  latest  updates  are  only 
available  monthly.  We  resolve  this  issue  by  comparing  the 
registered  date  for  the  IP  address  from  WholS  against  the 
date  of  when  the  latest  MaxMind  files  were  imported,  the 
location  which  has  the  most  current  date  is  selected.  Our 
next  step  in  implementation  is  to  visualize  the  geographical 
location  of  spam  attackers  from  the  spam  email  information 
gathered  after  logging  into  the  Gmail  account  from  the  C# 
web  application.  The  C#  web  application  provides  an  option 
to  visualize  geographical  locations  for  spam  attackers 
associated  with  the  given  email  account  and  to  view  stored 
spam  email  information.  We  are  currently  using  an  ASP. Net 
text  box  to  view  stored  spam  email  information  for  test 
purposes,  but  will  add  an  ASP. Net  Gridview  for  a  more 
organized  view. 
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Fig.  4.  Logging  into  the  Gmail  test  account,  processing  spam 
email  information  and  displaying  stored  information. 
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Fig.  5.  Visualizing  geolocation  of  spam  email. 


Fig.  7.  Filtering  by  the  Gmail.com  domain  and  visualizing 
geolocation  of  spam  email  associated  with  a  Gmail  account 


along  with  viewing  the  WhoIS  information  related  to  the 
account. 
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Fig.  8.  Filtering  by  the  klhh333oextinction.com  domain  and 
visualizing  geolocation  of  spam  email  associated  with  a 
klhh333oextinction  account  while  using  the  satellite  map 
option. 


Fig.  9.  Visualizing  geolocation  of  spam  email  for  the  Gmail 
test  account  with  the  region  with  the  most  spam  highlighted. 


VI.  Conclusion 

In  this  paper  we  present  a  visualization  environment  to 
visualize  the  geographical  locations  of  spam  attackers  on  a 
Google  map  using  IP  Geolocation,  the  integration  of  MaxMind 
and  WhoIS  databases,  and  the  Google  Maps  API.  Our  results 
validate  the  effectiveness  of  integrating  multiple  systems  to 
develop  an  application  for  preventing  spam.  Our  analysis 
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controls  within  this  visualization  tool  can  help  network 
analysts  and  other  users  identify  patterns  of  spam  attackers. 
Our  future  work  focuses  on  developing  a  visualization 
environment  to  visualize  geographical  locations  of  spam 
attackers  based  on  analysis  of  spam  emails  in  a  cloud 
computing  environment. 
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Abstract — Traditionally,  scheduling  policies  have  been  opti¬ 
mized  to  perform  well  on  metrics  such  as  throughput,  delay 
and  fairness.  In  the  context  of  shared  event  schedulers,  where  a 
common  processor  is  shared  among  multiple  users,  one  also  has  to 
consider  the  privacy  offered  by  the  scheduling  policy.  The  privacy 
offered  by  a  scheduling  policy  measures  how  much  information 
about  the  usage  pattern  of  one  user  of  the  system  can  be  learnt 
by  another  as  a  consequence  of  sharing  the  scheduler.  In  [1], 
we  introduced  an  estimation  error  based  metric  to  quantify  this 
privacy.  We  showed  that  the  most  commonly  deployed  scheduling 
policy,  the  first-come-first-served  (FCFS)  offers  very  little  privacy 
to  its  users.  We  also  proposed  a  parametric  non-work-conserving 
policy  which  traded  off  delay  for  improved  privacy.  In  this  work, 
we  ask  the  question,  is  a  trade-off  between  delay  and  privacy 
fundamental  to  the  design  to  scheduling  policies?  In  particular,  is 
there  a  work-conserving,  possibly  randomized,  scheduling  policy 
that  scores  high  on  the  privacy  metric?  Answering  the  first 
question,  we  show  that  there  does  exist  a  fundamental  limit 
on  the  privacy  performance  of  a  work-conserving  scheduling 
policy.  We  quantify  this  limit.  Furthermore,  answering  the  second 
question,  we  demonstrate  that  the  round-robin  scheduling  policy 
(a  deterministic  policy)  is  privacy  optimal  within  the  class  of 
work-conserving  policies. 

I.  Introduction 

In  multi-tasking  systems  where  a  finite  resource  is  to  be 
shared,  a  scheduler  dictates  how  the  resource  is  divided 
among  competing  processes.  Examples  of  systems  which 
have  schedulers  include,  a  computer  where  the  CPU  needs 
to  be  shared  between  the  different  threads  running,  a  cloud 
computing  infrastructure  with  shared  computing  resources,  a 
network  router  serving  packets  from  different  streams  etc.. 
Some  of  the  commonly  used  schedulers  are  first-come-first- 
served  (FCFS),  round-robin  (RR),  shortest-job-first  (SJF)  and 
priority  schedulers.  Performance  of  a  scheduler  is  measured 
in  one  of  several  metrics  including,  throughput  (number  of 
job  completions  per  unit  time),  average  delay  (the  difference 
between  the  job  completion  time  and  the  job  arrival  time),  fair¬ 
ness  (a  metric  to  measure  if  the  resource  is  being  distributed 
equally/fairly  between  the  processes),  etc..  A  scheduler  often 
has  to  make  a  calculated  trade-off  among  these  conflicting 
metrics. 

We  consider  the  scenario  when  a  scheduler  is  serving  jobs 
from  two  users,  where  one  of  them  is  an  innocuous  user  and 
other  a  malicious  one.  The  malicious  user.  Bob,  wishes  to 
learn  the  pattern  of  jobs  sent  by  the  innocuous  user,  Alice. 
Bob  exploits  the  fact  that  when  the  processor  is  busy  serving 
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jobs  from  Alice,  his  own  jobs  experience  a  delay.  As  shown 
in  Figure  1,  Bob  computes  the  delays  experienced  by  his  jobs 
and  uses  these  delays  to  infer  about  the  times  when  Alice 
tried  to  access  the  processor,  and  possibly  the  sizes  of  jobs 
scheduled.  Learning  this  traffic  pattern  from  Alice  can  aid  Bob 
in  carrying  out  traffic  analysis  attacks.  The  scheduling  system 
thus  incidentally  creates  a  timing  based  side  channel  that  can 
be  exploited  by  a  malicious  user.  In  [2],  the  authors  consider 
the  scenario  where  a  client  is  connected  to  a  rogue  website 
using  a  TOR  network.  The  website  modulates  the  traffic  sent 
to  the  client.  The  side  channel  considered  here  exists  in  the 
intermediate  routers.  They  show  that  an  eavesdropper  can 
exploit  it  to  figure  out  the  identity  of  the  client  talking  to 
the  website,  thus  defeating  the  purpose  of  TOR.  While  that 
attack  is  no  longer  viable  [3],  the  reason  is  that  there  are 
many  more  TOR  nodes  now  than  they  were  when  [2]  was 
published,  and  not  because  the  timing  based  side  channel  has 
been  eliminated.  In  [4],  the  authors  exploit  the  side  channel  in 
a  DSL  router  to  infer  the  website  being  visited  by  the  victim. 
A  similar  side  channel  exists  within  Amazon’s  EC2  cloud 
computing  service,  which  is  exploited  in  [5],  Other  works 
on  traffic  analysis  include:  recovery  of  information  about 
keystrokes  typed  [6],  words  spoken  over  VoIP  [7],  and  utilizing 
the  timing  variations  required  for  cryptographic  operations  to 
recover  cryptographic  keys  [8].  Motivated  by  these  attacks, 
we  argue  that  while  choosing  a  scheduler,  one  has  to  consider 
the  privacy  offered  by  it  along  with  the  other  performance 
based  metrics.  This  should  especially  be  the  case  when  the 
scheduler  serves  processes  from  several  non-trusting  users, 
e.g.,  a  scheduler  used  in  a  cloud  computing  infrastructure. 

In  this  paper,  we  study  a  generic  shared  scheduler  shown 
in  Figure  1.  For  such  systems,  in  order  to  minimize  the 
information  leakage,  one  has  to  design  ‘privacy  preserving’ 
scheduling  policies.  As  a  result  of  high  correlations  between 
the  arrivals  of  one  user  and  the  waiting  times  of  the  other, 
FCFS  is  an  example  of  a  bad  policy  in  this  respect,  [1],  An 
example  of  a  good  privacy  preserving  scheduling  policy  is 
the  time  division  multiple  access  (TDMA),  where  a  user  is 
assigned  a  fixed  service  time  regardless  of  whether  he  has  any 
jobs  that  need  to  be  processed  or  not.  As  expected,  the  waiting 
times  of  jobs  issued  by  one  are  independent  of  the  others’ 
arrivals,  and  consequently,  the  policy  leaks  no  information. 
However,  TDMA  is  a  highly  inefficient  policy  in  terms  of 
throughput  and  delay,  especially  when  the  traffic  is  varying. 
It  is  especially  inefficient  when  the  number  of  users  using 
the  scheduler  is  large  [1],  FCFS  and  TDMA  represent  two 
extremes  of  the  trade-off  between  the  information  leakage  and 
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efficiency  (in  terms  of  delay  or  throughput). 

Scheduling  policies  in  which  the  server  never  idles  as  long 
as  there  is  an  unserved  job  in  the  system  are  said  to  be 
work-conserving  or  non-idling.  Examples  of  work-conserving 
policies  include  FCFS  and  round-robin  (RR).  On  the  other 
hand,  scheduling  policies  in  which  the  server  is  allowed  to  stay 
idle  even  when  there  are  unserved  jobs  in  the  system  are  said 
to  be  non-work-conserving  or  idling  policies.  Both  the  TDMA 
and  the  accumulate  and  serve  policies  (derived  in  [1]),  those 
that  offer  a  guaranteed  privacy,  are  non-work-conserving.  Is 
delay  an  inevitable  price  that  needs  to  be  paid  for  guaranteed 
privacy?  Instead,  could  the  scheduler  use  a  private  source  of 
randomness  to  obfuscate  the  attacker? 

When  all  the  jobs  are  of  the  same  size,  it  can  be  shown  that 
all  work-conserving  non-preemptive 1  policies  incur  the  same 
average  delay  [9],  Also,  policies  that  idle  incur  a  delay  which 
is  strictly  greater  than  that  incurred  by  a  work-conserving 
policy.  In  addition,  the  delay  offered  by  an  idling  policy  is 
strictly  larger  than  the  delay  offered  by  a  work-conserving 
policy.  Work-conserving  policies  therefore  represent  a  class 
of  throughput  and  delay-optimal  scheduling  policies.  In  this 
paper,  we  address  the  question:  How  does  the  most  secure 
work-conserving  scheduling  policy  stack  up  against  TDMA  on 
the  privacy  metric? 

A.  Outline  of  the  paper 

In  Section  II,  we  formally  introduce  a  system  model, 
and  the  metric  of  performance  that  we  use  to  compare  the 
privacy  of  different  scheduling  policies.  In  Section  II-A,  we 
quantify  the  highest  degree  of  privacy  that  any  scheduling 
policy  can  guarantee  (work-conserving,  or  otherwise),  and 
demonstrate  that  TDMA  provides  the  highest  privacy.  The 
privacy  performance  of  TDMA  is  used  to  benchmark  all  other 
scheduling  policies.  Next,  we  turn  our  attention  to  the  class 
of  work-conserving  policies.  Consider  a  fictitious  policy  that 
knows  the  identity  of  the  attacker,  and  gives  priority  to  jobs 
issued  by  him.  In  Theorem  3.1,  we  demonstrate  that  such  a 
policy  is  a  privacy  optimal  scheduling  policy  within  this  class. 
We  show  that  any  attack  that  can  be  carried  out  against  this 
policy  can  suitably  be  modified  and  carried  out  against  any 
other  work-conserving  scheduling  policy  as  well,  incurring 
the  same  error  for  the  attacker.  This  fact  is  used  to  bound 
the  privacy  offered  by  any  work-conserving  scheduling  policy. 
In  Section  III-A,  we  discuss  an  attack  against  this  policy. 
The  resulting  error  incurred,  denoted  by  Apriority'5’  serves 
as  an  upper  bound  on  the  privacy  performance  of  all  work- 
conserving  policies,  and  in  particular,  the  privacy  offered  by 
round-robin,  denoted  by  £pp.  In  Section  IV,  we  consider  the 
privacy  performance  of  the  round-robin  policy  and  construct  a 
lower  bound  to  it,  denoted  by  £ppOWC1.  Numerically  computing 
^Priority  ^  anc*  ^re™™  not  straightforward.  In  Section  V, 
we  relate  the  computation  of  these  errors  to  a  combinatorial 
counting  problem  which  can  then  be  solved  numerically.  It  is 
then  argued  that  a  parameter  of  the  attack,  e  can  be  chosen 

'Non-preemptive  policies  are  those  in  which  the  processing  of  a  job  is 
never  interrupted  once  it  starts  getting  served.  Throughout  this  paper,  we  will 
only  consider  non-preemptive  policies. 


Fig.  1.  An  event/packet  scheduler  being  exploited  by  a  malicious  user  to 
infer  the  arrival  pattern  from  the  other. 

suitably  so  that  the  upper  bound  on  matches  the  lower 
bound  exactly,  thus  proving  the  optimality  of  the  round-robin 
policy  on  the  privacy  metric.  However,  as  shown  in  Figure 
5,  there  is  a  large  gap  between  the  privacy  offered  by  round- 
robin  and  that  by  TDMA.  Therefore,  if  the  delay  offered  by  a 
scheduling  policy  is  of  a  higher  importance  than  the  privacy 
offered  by  it,  i.e.,  if  one  is  looking  for  a  policy  within  the 
class  of  work-conserving  policies,  then  the  round-robin  policy 
is  a  good  candidate.  Otherwise,  if  the  privacy  offered  by  it  is 
of  a  higher  importance,  one  has  to  pay  the  price  of  increased 
delay.  Finally,  in  Section  V-A,  we  discuss  the  implications  of 
our  work. 

This  work  builds  on  our  earlier  works  [1],  [9].  In  [1],  we 
develop  a  formal  framework  to  study  the  information  leakage 
in  shared  schedulers.  In  that  work,  it  was  shown  that  the 
FCFS  (a  work-conserving  policy)  scheduling  policy  does  leak 
significant  information,  while  TDMA  (an  idling  policy)  leaks 
the  least.  We  also  proposed  and  analyzed  a  provably  secure 
scheduling  policy,  called  accumulate-and-serve,  another  idling 
policy,  which  traded  off  delay  for  improved  privacy. 

In  this  work,  we  ask  the  question,  is  there  a  work-conserving 
scheduling  policy  that  fares  high  on  the  privacy  metric?  This 
is  the  same  question  we  ask  in  [9],  The  major  difference 
between  that  work  and  the  current  one  (  and  also  between 
[1]  and  the  current  one)  is  the  metric  used  to  quantify  the 
privacy  offered  by  a  scheduling  policy,  which  is  discussed  in 
the  following  section.  The  results  derived  in  Section  III  of  this 
paper  are  similar  to  those  derived  in  [9],  However,  the  proofs 
are  different  owing  to  the  difference  in  the  definition  of  privacy 
metric.  Also,  the  results  provided  in  this  paper  subsume  those 
presented  in  [9],  and  are  much  stronger. 

II.  System  Model  and  Definitions 

The  system  model  is  a  scheduler  shared  by  two  users, 
the  innocuous  user,  Alice,  and  the  malicious  one.  Bob.  The 
scheduler  can  serve  jobs  at  a  rate  of  one  per  unit  time. 
Alice  issues  unit  sized  jobs  to  the  scheduler  according  to  a 
Poisson  process  of  rate  A2.  The  total  number  of  jobs  issued 
by  Alice  until  time  u  is  given  by  _4a(w)-  The  malicious 
user.  Bob,  also  referred  to  as  the  attacker,  issues  his  jobs  at 
times  i™  =  {fi,f2>  •  •  • ,  tn},  and  is  free  to  choose  their  sizes. 
Si  =  {si,s2,---,s„},  as  well.  Fet  t'f  =  {t[,  t2l . . . ,  tn} 
be  the  departure  times  of  these  jobs.  Bob  makes  use  of  the 
observations  available  to  him,  the  set  and  the 

knowledge  of  the  scheduling  policy  used,  in  estimating  Alice’s 
arrival  pattern.  The  arrival  pattern  of  Alice  is  the  sequence 
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{Xfe} k~i,2,...,N,  where  Xk  =  AA{kc)  -  AA((k  -  l)c),  is  the 
number  of  jobs  issued  by  Alice  in  the  interval  ((fc  —  l)c,  kc], 
referred  to  as  the  kth  clock  period  of  duration  c.  Nc  is  the 
time  horizon  over  which  the  attacker  is  interested  in  learning 
Alice’s  arrival  pattern. 

The  privacy  offered  by  a  scheduling  policy  is  measured 
by  the  long  run  estimation  error  incurred  by  Bob  in  such  a 
scenario  when  he  is  free  to  decide  the  number  of  jobs  he 
issues,  times  when  he  issues  them  and  their  sizes,  subject  to  a 
rate  constraint,  and  when  he  optimally  estimates  Alice’s  arrival 
pattern.  Formally,  the  privacy  offered  by  a  scheduling  policy 
is  defined  to  be: 
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where,  the  expectation  is  taken  over  the  joint  distribution  of 
the  arrival  times  of  Alice’s  jobs,  the  arrival  times  and  sizes 
of  jobs  from  the  attacker  and  his  departure  times.  This  joint 
distribution  is  in  turn  dependent  on  the  scheduling  policy  used, 
which  is  known  to  the  attacker.  Finally,  A  is  the  rate  restriction 
on  arrivals  from  Bob.  Furthermore,  the  attacker  is  assumed  to 
know  the  statistical  description  of  Alice’s  arrival  process.  A 
scheduling  policy  is  said  to  preserve  the  privacy  of  its  users 
if  the  resulting  estimation  error  for  the  attacker  is  high. 

In  this  work  we  only  consider  the  case  when  there  are  two 
users  of  the  system,  the  innocuous  user  and  the  attacker.  From 
a  privacy  perspective,  the  two  user  scenario  is  the  worst  case. 
It  is  true  that  if  there  are  more  users  of  the  system,  the  attacker 
can  only  learn  about  the  cumulative  arrival  pattern  from  all  the 
users.  However,  as  the  authors  in  [5]  state,  in  such  systems, 
the  attacker  typically  waits  for  a  time  when  he  can  be  assured 
that  the  victim  is  the  only  other  user  of  the  scheduling  system 
and  launches  an  attack  then.  A  policy  that  fares  well  on  the 
privacy  metric  in  the  two  user  scenario  is  also  guaranteed  to 
perform  well  in  the  multiple  user  scenario. 

Note  that  we  allow  the  attacker  to  choose  the  size  of  the 
jobs  that  he  issues.  This  is  in  contrast  to  our  previous  works, 
[1]  and  [9],  where  he  could  only  issue  jobs  of  size  one.  The 
results  derived  in  this  work  are  therefore  stronger  in  the  sense 
that  a  policy  that  is  secure  on  this  metric  is  also  provably 
secure  on  the  earlier  metric. 


A.  The  maximum  estimation  error  that  the  attacker  can  incur 

With  the  metric  of  privacy  as  defined  in  (1),  it  is  easy  to 
quantify  the  maximum  estimation  error  any  rational  attacker 
would  incur,  as  show  in  Theorem  4.1  in  [1],  By  ignoring 
all  the  observations  available  to  him,  viz.,  {t”,  s", tf1},  and 
estimating  Xk  using  its  statistical  mean  alone,  A2C,  the  attacker 
incurs  an  error  equal  to  the  variance  of  Xk,  which  is  also  equal 
to  ££[ax  =  A2C.  Hence,  £^ax  serves  as  a  benchmark  against 
which  other  scheduling  policies  can  be  compared. 

Also,  as  shown  in  Section  IV  in  [1],  the  time-division- 
multiple-access  (TDMA)  scheduling  policy  achieves  this 
bound.  This  is  because,  when  TDMA  scheduling  policy  is 


used,  the  departures  of  one  user  are  completely  independent 
of  the  arrivals  from  the  other.  Therefore,  TDMA  is  a  privacy 
optimal  scheduling  policy.  However,  as  discussed  in  Section 
III  of  [9],  because  TDMA  is  non-work-conserving,  it  loses  out 
on  performance  based  metrics  such  as  throughput  region  and 
delay.  In  the  two  user  scenario,  the  rate  at  which  each  user 
issues  jobs  needs  to  be  less  than  0.5  in  order  for  the  system 
to  be  stable.  A  system  is  said  to  be  stable  if  the  number  of 
unserved  jobs  in  the  system  does  not  blow  to  infinity.  However, 
if  the  scheduler  instead  used  a  work-conserving  policy,  the 
system  would  be  stable  as  long  as  the  sum  of  the  rates  at 
which  the  two  users  issued  their  jobs  is  less  than  1.  Also, 
unless  the  arrivals  are  periodic,  TDMA  incurs  large  delays. 

In  the  subsequent  section,  we  identify  the  most  secure 
work-conserving  scheduling  policy  and  characterize  its  privacy 
metric. 

III.  The  most  secure  work-conserving  policy 

In  this  section,  we  derive  a  bound  on  the  privacy  perfor¬ 
mance  of  any  work-conserving  policy.  We  do  so  by  showing 
that  if  the  scheduler  were  allowed  to  pick  any  work-conserving 
policy  that  served  jobs  from  both  Alice  and  the  attacker,  the 
best  strategy  for  it  is  to  pick  the  policy  that  gives  priority  to 
jobs  from  the  attacker.  Therefore,  analyzing  the  performance 
of  the  priority  policy  serves  as  a  bound  on  the  performance  of 
any  other  work-conserving  scheduling  policy.  Although  this 
policy  is  not  implementable,  because  the  scheduler  would 
not  know  the  identity  of  the  attacker,  analyzing  the  privacy 
performance  of  this  fictitious  policy  gives  a  bound  on  the 
privacy  performance  of  all  work-conserving  policies. 

Theorem  3.1:  A  scheduling  policy  that  gives  priority  to  jobs 
from  the  attacker  is  a  privacy  optimal  scheduling  policy  within 
the  class  of  non-idling  policies.  That  is,  if  WC  is  the  class  of 
all  work-conserving  policies,  and  £priority  is  the  privacy  metric 
of  the  policy  that  gives  priority  to  jobs  from  attacker, 

£p- X  <  Priority.  VP  €  WC.  (2) 

Proof:  A  proof  is  given  in  Appendix  A  ■ 

In  the  following  section,  we  specify  one  attack  against  the 
priority  policy  and  compute  the  resulting  error. 

A.  An  attack  against  the  priority  policy 

Without  loss  of  generality,  we  will  assume  that  at  time  0 
the  system  is  completely  empty,  i.e.,  there  are  no  outstanding 
jobs  from  any  of  the  users.  At  time  0,  the  attacker  issues  a 
job.  From  then  on,  he  injects  a  new  job  one  time  unit  after  the 
completion  of  his  previous  job.  Formally,  let  ti,  t2,  ■  ■  ■  be  the 
times  when  attacker  injects  jobs  into  the  system,  and  t1,  t2,  ■  ■  ■ 
be  their  departure  times.  Then,  t\  =  0,  and, 

tk+i  =  tk  + 1,  k  =  1,2,...  (3) 

The  size  of  all  jobs  is  e,  a  parameter  which  will  be  specified 
later.  At  the  rate  the  attacker  issues  his  jobs,  the  system  can 
be  shown  to  be  stable  when  Alice’s  arrival  rate  is  less  than 
1  —  e.  Therefore,  e  has  to  be  chosen  to  be  less  than  1  A2  so 
that  the  system  is  kept  stable. 
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Analysis  of  the  estimation  error  incurred:  Note  that,  for 
some  job  k ,  if  tk  —  tk  =  e,  i.e.,  if  the  kth  job  goes  into  service 
immediately  after  it  is  issued,  then  it  must  be  the  case  that 
the  system  was  empty  when  the  job  was  issued,  at  time  tk- 
A  busy  period  of  the  scheduling  system  is  an  interval  when 
the  processor  is  busy  serving  jobs  of  wither  of  the  users.  The 
following  lemma  states  that,  through  this  attack.  Bob  learns 
the  start  and  end  times  of  all  the  busy  periods.  Define  rf  = 
{r i,  r2, . . . ,  to  be  the  start  times  of  the  busy  periods  until 
time  Nc,  and  let  r-f  =  {r1,r2,  -  ■  ■ ,  rL}  be  the  end  times  of 
these  periods. 

Lemma  3.2:  The  start  and  end  times  of  the  busy  periods 
can  be  computed  by  the  attacker.  Formally,  rJ{  and  are  a 
deterministic  function  of  the  arrival  and  departure  times,  t\\ 

and  t'f. 

Lemma  3.3:  Given  the  end  times  of  the  busy  periods,  the 
arrival  and  departure  times  of  the  attacker’s  jobs  can  be 
computed.  Formally,  £”  and  t-f  are  a  deterministic  function 
of  r'L . 

Proof:  The  proofs  of  these  two  lemmas  are  given  in 
appendices  B  and  C  respectively.  ■ 

As  a  consequence  of  Lemmas  3.2  and  3.3,  we  have 
E[-Xfc|t",  tf]  =  E[Xfc|r1i:'].  Therefore,  the  estimation  error 
incurred  by  the  attacker  is  the  estimation  error  incurred  in 
estimating  the  arrival  pattern  knowing  the  start  and  end  times 
of  the  busy  periods.  In  addition  to  the  jobs  issued  by  Alice, 
the  jobs  issued  by  the  attacker  also  determine  the  duration  of 
the  busy  periods.  It  can  be  shown  that  the  attacker  is  better 
off  estimating  the  arrival  pattern  of  Alice  if  the  busy  periods 
are  of  short  duration.  When  the  busy  periods  are  long,  there 
are  many  possible  arrival  patterns  that  lead  to  a  busy  period 
of  the  same  duration,  leading  to  higher  estimation  error  for 
the  attacker.  Notice  that  the  results  of  Lemmas  3.2  and  3.3 


hold  true  for  all  values  of  e,  the  size  of  jobs  issued  by  the 
attacker.  Consequently,  as  shown  in  Section  V,  in  order  to 
estimate  Alice’s  arrival  pattern  with  highest  precision,  the 
attacker  should  choose  the  value  of  e,  the  size  of  his  jobs, 
to  be  as  small  as  possible. 

Denote  by  fp^ority  C  the  resulting  estimation  error  incurred 
by  the  attacker.  We  will  defer  the  computation  of  the  best  esti¬ 
mate,  E[Xfc|r1i],  and  the  resulting  estimation  error,  £p^^y’e, 
to  Section  V.  Since  £p’riority  is  the  smallest  error  the  attacker 
can  incur  among  all  the  attacks  that  he  can  possibly  launch. 


and  Apriority  6  *s  the  error  incurred  by  launching  one  specific 
attack,  we  have  >  ferity’  VA  <  1  -  e.  Also,  as 

a  consequence  of  (2),  fp’rioHty’'  —  ^pA>  ^  e  WC,VA  < 
1  —  e.  In  particular,  £p’riority  e  bounds  the  privacy  performance 
of  the  round-robin  policy.  In  the  following  section,  we  will 


provide  a  lower  bound  on  the  estimation  error  incurred  by  any 


attacker  against  the  round-robin  scheduling  policy,  denoted  by 
£rr  WC‘  ■  Therefore,  the  following  set  of  inequalities  follows: 


nc, upper, e  ^  pc,A  r»c,A  \  cc>l°wer  V/\  1 

^Priority  —  ^Priority  —  ^RR  —  ^RR  ’  VA  —  1 


(4) 


In  Section  V,  it  will  be  shown  that  lim  Priority  6  =  ^rr  WCT’ 

thus  proving  that  the  bound  computed  on  Priority  ht  ’■his 
section  is  tight.  And  more  importantly,  that  round-robin  is  a 


privacy  optimal  scheduling  policy  within  the  class  of  work- 
conserving  policies. 

IV.  Privacy  performance  of  the  round-robin 
policy 

The  round-robin  scheduling  policy  serves  jobs  from  multiple 
users  as  follows.  Suppose  there  are  m  users  issuing  jobs  to 
the  scheduler,  indexed  1  through  m.  After  completion  of  a 
job  issued  by  user  i,  the  scheduler  works  on  a  job  from  user 
i  +  1,  if  present.  If  there  are  no  jobs  from  user  i  +  1,  the 
scheduler  works  on  a  job  from  user  i  +  2,  and  so  on.  This 
is  known  to  be  a  ‘fair’  policy  [10],  and  because  it  is  non¬ 
idling,  it  is  also  throughput  optimal.  In  this  section,  we  will 
show  that  it  is  also  optimal  on  the  privacy  metric  within  the 
class  of  work-conserving  policies.  We  start  by  constructing 
a  lower  bound  to  £RR,  the  estimation  error  incurred  by  the 
strongest  attacker  against  the  round-robin  policy.  We  do  so  by 
providing  the  attacker  with  a  side  information.  Without  this 
extra  information,  the  attacker  can  only  be  worse  off  in  his 
estimation. 

A.  A  lower  bound  on  the  £R4 

Consider  a  round-robin  scheduler  where  Alice  is  the  only 
user  of  the  system.  The  times  when  she  issues  her  jobs  is  given 
by  the  cumulative  arrival  process  Atfu),  where  u  indexes 
time.  Let  T>\(u)  denote  the  total  amount  of  service  received 
by  Alice  until  time  u  in  this  system.  Let  A\  and  T>\  represent 
the  functions  AIa(m),Vu  and  V\(u),\/u  respectively.  A\  is  a 
counting  function,  and  T)\  is  a  non-decreasing  function  with 
a  slope  either  0  or  1  (the  function  is  differentiable  almost 
everywhere).  If  the  slope  is  1  at  a  time  u,  the  processor  is  busy 
serving  a  job  then.  If  the  slope  is  0,  then  the  processor  has 
finished  serving  all  the  jobs  issued  by  Alice  till  then  (scheduler 
never  idles).  Note  that  T>\  <  A\  and  consequently,  V\(u)  is  a 
lower  bound  on  the  total  number  of  jobs  that  have  arrived  from 
Alice  till  time  u.  Now,  consider  a  round-robin  scheduler  that 
is  used  both  by  Alice  and  Bob,  as  shown  in  Figure  2.  Suppose 
Alice’s  arrivals  are  the  same  as  in  the  earlier  system.  Let  tf  be 
the  times  when  Bob  issues  his  jobs,  s"  be  their  sizes,  and  t-f 
be  their  departure  times.  Denote  by  V\(u)  the  total  service 
received  by  Alice  until  time  u  for  this  system.  Having  chosen 
his  arrival  times  and  sizes,  and  having  observed  their  departure 
times.  Bob  has  to  estimate  the  total  number  of  arrivals  from 
Alice  in  a  clock  period  k.  We  will  consider  the  scenario  where 
the  attacker  is  given  T>\  as  a  side  information  as  shown  in 
Figure  2.  We  will  show  that  the  resulting  estimation  error  for 
this  attacker  is  a  lower  bound  on  £RR. 

Theorem  4.1:  The  estimate  E[Xfc|f", s”, tf1]  is  an  inferior 
estimate  compared  to  E^YfclD^]-  Therefore,  if  the  attacker  is 
given  a  side  information,  the  function  T>\,  his  own  arrival  and 
departure  times  give  him  no  further  information  about  Alice’s 
arrival  pattern,  and  consequently  can  be  discarded. 

We  first  prove  the  following  two  lemmas  which  form  the 
basis  of  the  proof  of  the  theorem  stated  above. 

Lemma  4.2:  The  departure  times  of  jobs  issued  by  the 
attacker,  t-f,  are  a  function  of  their  arrival  times,  their 
sizes  s™,  and  T>\. 
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Fig.  2.  Pictorial  representation  of  the  computation  of  a  lower  bound 

on  Apart  from  the  information  available  to  him  through  his  attack,  the 

attacker  is  also  given  a  side  information  (shown  in  dotted  arrow)  which  are 
the  departures  if  Alice  was  the  only  user  of  the  scheduling  system. 


Lemma  4.3:  The  arrival  times  t%  are  a  function  of  / 1 ,  si, 
and  T>\.  Also,  t\  and  si  are  independent  of  T>\  and  X k . 

Proof:  The  proofs  of  these  lemmas  are  presented  in 
Appendix  D-A  and  D-B,  respectively.  ■ 

Proof  of  Theorem  4.1:  To  prove  the  theorem,  first  note 
that  the  estimate  E[Xfc|i”,  s™,  t™, Tff\  is  a  superior  estimate 
of  Xk  compared  to  E[X/-|f",  s™,  tf1]  (more  information  the 
attacker  has,  more  accurate  his  estimation  of  Alice’s  traf¬ 
fic  pattern  will  be).  As  a  result  of  Lemmas  4.2  and  4.3, 

nxk\vt,s^,t^,v\]  =  E[Xfe|f1)Sl,I?i]  =  nxk\v\}. 


As  a  consequence  of  Theorem  4.1,  we  have, 


V{M},  E 


(xk-E[Xk\ 

(Xk  -  V[Xk\V\]f 


>  E 


and  therefore,  > 


HR  ^  -RR  .  Where- 

1  JY 


£RRwer  =  {Xk-E[Xk\Vi)Y 


fc= 1 


(5) 


In  the  following  section,  we  present  a  technique  to  compute 

c>c,  lower  •  n 

cRR  numerically. 


V.  Computation  of  the  best  estimate  and  the 

RESULTING  ESTIMATION  ERROR 

In  this  section,  we  present  an  algorithm  to  compute 
E[X*.|  T>\],  and  fpR°WCI  numerically.  We  do  so  by  specifying 
an  equivalence  between  the  estimation  problem,  and  an  equiva¬ 
lent  combinatorial  path  counting  problem.  The  same  technique 
can  also  be  used  to  compute 

Recall  that  £’|‘;|l)nvc'  is  the  estimation  error  incurred  by  an 
attacker  who  is  given  the  departure  process  of  a  scheduler 
used  only  by  Alice.  Note  that  this  additional  side  information 
is  equivalent  to  providing  the  attacker  with  the  start  and 
end  times  of  the  busy  periods  of  a  scheduler  used  only  by 
Alice.  We  start  by  considering  a  slightly  different  problem  of 
constructing  the  best  estimate  of  the  number  of  arrivals  within 
a  busy  period.  Formally,  suppose  a  busy  period  of  duration 
B  +  1  is  initiated  at  time  0,  where  B  is  some  non-negative 
integer.  Suppose  an  attacker  observes  that  the  processor  is 
busy  from  time  0  to  time  B  +  1,  and  he  knows  that  there 
is  only  one  user  issuing  jobs  to  the  system.  He  wishes  to 
estimate  the  number  of  arrivals  between  times  (111,112)  where. 


Fig.  3.  Two  counting  functions  that  lead  to  a  busy  period  of  duration  5. 


0  <  ui  <  U2  <  B  +  1.  The  job  that  initiates  the  busy 
period  arrives  at  time  u o  =  0.  Let  u\,  U2,  ■  ■  ■ ,  uB,  ub+i  be 
the  arrival  times  of  the  next  B  +  1  jobs.  Then,  in  order  to 
sustain  the  busy  period  of  duration  B  +  1,  the  arrival  times 
of  these  jobs  should  satisfy  u\  <  1, w2  <2 , ...  ,uB  <  B  and 
ub+i  >  B+l.  This  is  because,  the  job  from  Alice  that  arrived 
at  time  zero  goes  into  service  immediately  and  departs  at  time 
one.  Therefore,  in  order  for  the  busy  period  to  sustain  beyond 
time  one,  there  must  be  atleast  one  more  arrival  by  then.  By 
a  similar  argument,  the  second  job  has  to  arrive  before  time 
two,  and  so  on.  Finally,  for  the  busy  period  to  end,  we  need 
ub- i-i  >5  +  1.  Figure  3  shows  two  possible  arrival  patterns 
(shown  as  counting  functions)  that  could  lead  to  a  busy  period 
of  duration  5.  In  fact,  any  counting  function  that  lies  in  the 
green  region  (lightly  shaded  region)  leads  to  a  busy  period  of 
duration  5.  In  the  figure,  the  arrival  at  time  0  is  not  shown. 

Let  {iVsj-^o  be  a  Poisson  process  of  rate  A2.  For  a  positive 
integer  t,  and  non-negative  integers  i,j,j  >  i,  define 

5ij(t)  =Pr (Nt  =j,Ns  >s,s  =  1,2, . . .  ,t  -  1|JV0  =  i). 

Si  j(t)  is  the  probability  that  a  Poisson  counting  function  of 
rate  A2  jumps  to  state  j  by  time  t  given  that  it  starts  at  state 
i  at  time  0  while  staying  above  the  boundary  Ns  =  s,s  = 
1,  2, . . . ,  f  —  1.  It  can  be  shown  that  the  expected  number  of 

arrivals  by  time  t  within  a  busy  period  of  duration  B  +  1  is 

B-\-l  _  _ 

given  by  B-t(B  —  t).  Computation  of  6ij(t) 

i=t ' 

is  necessary  in  order  to  compute  the  best  estimates  and  the 
resulting  error.  However,  deriving  a  closed  form  expression 
for  Si  j(t)  is  not  easy  other  than  some  special  cases.  We 
transform  the  problem  of  computing  5ij(t)  to  a  combinatoric 
path  counting  problem  which  admits  a  numerical  solution. 
Before  doing  so,  we  state  the  following  theorem  which  gives 
an  equivalence  between  a  Poisson  counting  process,  and  a 
Geometric  approximation  to  it  which  is  used  later. 

Lemma  5.1:  Let  {54”}i= 0,1,2,...  be  a  sequence  of  i.i.d.  Geo¬ 
metric  random  variables  indexed  by  integer  n,  with  Pr  (Yf  = 

Ls"J 

k)  =  pk(  1  -  p),  k  =  0, 1,  2, . . ..  Define  Nf  =  £  Yf.  Let 

i— 0 

p  scale  with  n  such  that  np  =  A,  where  A  is  a  constant.  As 
defined  earlier,  let  Ns  be  a  Poisson  process  of  rate  A2.  Then, 
for  any  finite  integer  k,  and  any  0  <  G  <  f2  <  . . .  <  tk,  the 
joint  distribution  of  7Vt" ,  iVt" , . . . ,  Nf  converges  to  the  joint 
distribution  of  Ntl ,  Nt2 , . . . ,  Ntk  as  n  — >  00. 

Proof:  Refer  Section  2.2.5  and  in  particular.  Theorem 
2.2.4  and  Corollary  2.2.1  of  [11],  The  reference  gives  a 
Bernoulli  approximation  of  a  Poisson  process.  The  proof  for 
the  Geometric  random  variable  is  very  similar.  ■ 


6 


Lemma  5.1  states  the  following.  Suppose  a  particle  moves 
on  a  lattice,  moving  right  a  distance  of  -  and  moving  up  a 
distance  of  k  with  probability  (^)fe(l  —  ^),  k  =  0, 1,  2, . . .. 
Then,  the  path  of  such  a  particle  follows  a  Poisson  process  of 
rate  A2  in  the  limit  n  — >  00.  In  Figure  4,  in  green,  we  plot  the 
sample  path  of  a  particle  that  moves  in  this  fashion. 

The  equivalence  between  computation  of  8ij(t)  and  a 
combinatoric  counting  problem  is  given  in  the  following: 


=  Pr(Nt  =j\N0  =  i ) 

Pr (Nt  =  j,Na  >  s,s  =  l,...,t  -  l|iVo  =  i) 


Pr(Nt  =  j\N0  =  i) 


=  e 


-A2t(A  2t)j~i 


x  lim 


O'  -  *)! 

Pr  (N?  =  j,N?>s,s=l,...,t-  1|%  =  i) 


Pr  (N?  =  j\Nft  =  i) 


(6) 


0-A2t(A 


x  lim 


#Paths(0,i)  — >  ( nt,j )  avoiding  the  boundary 


#Paths(0,i)  — >  ( nt,j ) 


(7) 


(6)  follows  from  the  results  of  Lemma  5.1.  The  denominator 
of  the  fraction  on  the  right  side  of  the  equality  in  (6)  is  the 
probability  that  a  particle  starting  at  the  point  (0,  i)  and  moving 
according  to  the  Geometric  process  described  before  hits  the 
point  (tn,j).  The  numerator  is  the  probability  that  a  particle 
starting  at  (0,  z)  hits  the  point  ( tn,j )  while  staying  above  the 
boundary  {(a,  b)  :  b  =  [-]  —  1}.  Note  that  for  this  geometric 
process,  there  are  a  finite  number  of  paths  between  the  two 
points.  In  all  these  paths  that  originate  at  (0,  i)  and  terminate 
at  (tn,j),  the  particle  ‘moves  right  and  up’  j  —  i  times  and 
‘moves  right  without  jumping’  nt  —  ( j  —  i)  times.  Therefore, 
the  probability  of  the  particle  taking  any  of  these  paths  is 
the  same,  equal  to  (^-)7_l(l  —  Ff)nt .  Therefore  the  ratio  of 
the  two  probabilities  is  just  equal  to  the  ratio  of  the  number 
of  lattice  paths  on  a  grid  that  avoid  a  boundary  to  the  total 
number  of  lattice  paths  between  the  aforementioned  points. 

(7)  therefore  follows,  where  the  boundary  is  the  set  of  integer 
points  (a,  b)  which  satisfy  {b  =  [— ]  —  1}.  In  Figure  4,  the  red 
line  corresponds  to  the  boundary. 

The  significance  of  Lemma  5.1  is  that,  for  any  finite  n,  the 
number  of  lattice  paths  can  be  counted.  It  is  easy  to  see  that 
the  denominator  in  (7)  is  given  by  (nt+Jr®)  =  + 

where  lim  o(7p_l) /n7_*  =  0.  There  is  no  closed 

n — >00 

form  expression  for  the  numerator  in  (7)  though.  Counting  the 
number  of  lattice  paths  between  two  points  on  a  grid  while 
avoiding  a  boundary  is  an  extensively  studied  combinatorial 
problem  with  several  applications,  refer  [12],  Using  the  lemma 
stated  below,  we  can  show  that  the  numerator  in  (7)  is  given 
by  7<z,  j,t)n^~l  +  o(nP~l),  and  the  value  of  7 can  be 
computed  exactly. 

Lemma  5.2  (Lemma  3 A  of  Chapter  1  in  [12]):  The  num¬ 
ber  of  paths  dominated  by  the  path  p  with  vector 
(01,  (I2 , . . . ,  an)  can  be  recursively  calculated  as  Vn  using  the 


Fig.  4.  Sample  path  of  a  particle  (the  path  in  green)  that  moves  on  the 
grid  according  to  the  Geometric  process.  From  Lemma  5.1,  b(B  ~  t)  can 
be  computed  by  counting  the  number  of  total  number  of  paths  from  (tn,i)  to 
(Bn,B),  and  those  that  lie  above  the  boundary  (denoted  in  red),  and  taking 
the  appropriate  ratio,  in  the  limit  n  — >  oo. 


recursion  formula 

k 

Vk = (afe”T + x)  Vk-j ’  y° = L  (8) 

The  definition  of  the  vector  representation  of  a  path  and  of 
path  domination  is  given  in  Sections  1  and  3,  respectively,  of 
Chapter  1  of  [12], 

The  number  of  lattice  paths  from  (0,i)  to  ( tn,j )  staying 
above  the  specified  boundary  is  equal  to  the  number  of  lattice 
paths  from  the  origin  to  ( tn ,  j  —  i)  dominated  by  a  path  with 
vector  (ai  =  n(i  +  1),  a2  =  n{i  +  2), . . . ,  at-i- 1  =  n(t  — 
1  ),at—i  =  nt,a,t-i+i  =  nt, . . .  ,dj-i  =  nt).  Using  this  fact, 
the  ratio  in  (7),  7 (i,j,t)(j  —  i)\/P~l,  and  consequently  8ij(t) 
can  be  evaluated  exactly ,  for  any  integers  i,  j,  t.  A  suitably 
modified  definition  of  Sij(t)  for  non-integer  values  of  t  can 
be  expressed  in  terms  of  <5*  j([t_|)  and  and  can  be 

computed. 

Now,  going  back  to  the  evaluation  of  f^wer,  let  r[J  and 
r/'  denote  the  start  and  end  times  of  busy  periods  of  a 
system  in  which  Alice  is  the  only  user.  Note  that  providing 
the  attacker  with  the  side  information  T)\  is  equivalent  to 
providing  him  with  the  start  and  end  times  of  the  busy  periods. 
Now  define  Fk  =  r k  —  rk_1,  k  =  2,  3, . . .,  with  F\  =  77 .  The 
random-variables  {Fk}k>2  are  independent  and  identically 
distributed  random  variables  (owing  to  the  fact  that  arrivals 
from  Alice  follow  a  Poisson  process  which  is  memoryless), 
and  consequently,  the  end  times  of  the  busy  periods  form  a 
renewal  process.  Let  Pk  denote  the  set  {Fi,  F;+i, . . . ,  Fm} 
where  l  =  arg max, {r,j  <  (k  —  l)c}  and  m  =  argmin j{fj  > 
kc}.  Pk  is  the  set  of  the  busy  periods  that  ‘cover’  clock 
period  k.  It  can  be  shown  that  the  pair  ( Xk,Pk )  form  a 
Markov  chain,  and  also  E[Xfc|T>^]  =  E[Xfc|Pfc],  Blackwell’s 
celebrated  renewal  theorem  can  be  used  to  compute  the  joint 
distribution  of  (A f,Pk),  expressing  it  in  terms  of  Sij(t).  The 
resulting  estimation  error  can  then  be  computed  numerically. 
The  same  procedure  can  be  used  to  compute  Apriority’6  as  weH- 
Furthermore,  it  can  be  shown  that  in  the  limit  e  — >  0,  the  error 
incurred  by  the  attacker  is  same  as  £pp°wcl  ■  This  result  is  true 
because,  as  described  in  Section  III-A,  through  the  attack  Bob 
learns  the  start  and  end  times  of  the  busy  periods  of  the  system, 
for  all  values  of  e.  When  the  size  of  jobs  issued  by  Bob  is 
small,  the  busy  periods  of  this  system  are  statistically  identical 
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Fig.  5.  Plot  of  ^RWer/^Max  f°r  two  cases  when  the  clock  period 

is  c  =  2  and  c  =  5.  These  are  also  the  curves  for  lirn 

A  curve  for  ^p^/^ax  hes  below  ^’pJ>wer/^Max  f°r  anY  work-conserving 
policy  P. 


to  the  busy  periods  of  a  system  where  Alice  is  the  only  user. 
The  details  are  skipped  for  the  lack  of  space.  In  Figure  5,  we 
plot  the  computed  values  of  £RRWer,  which  is  also  equal  to 

v  <7C,  upper,  e 
Priority’  • 


A.  Discussion 

Recall  that  £priority  is  a  bound  on  the  performance  of  all 
work-conserving  policies,  and  in  particular,  the  privacy  offered 
by  round-robin,  which  is  e  is  an  upper  bound 

to  £priority  and  £pp°W01  is  a  lower  bound  to  £pp.  These 
errors  are  normalized  by  the  maximum  privacy  that 

any  policy  can  offer.  A  normalized  error  close  to  zero  means 
that  the  policy  offers  very  little  privacy,  and  if  it  is  close  to 
one,  the  attacker  learns  no  information  about  Alice’s  arrival 
pattern.  In  Figure  5,  we  plot  £ppWer/£Max  as  a  function  of 
A2,  the  arrival  rate  of  jobs  from  Alice.  Note  that  this  is  also 
equal  to  lim  £pr“^®y  e/£^jax>  and  as  a  consequence  of  (4), 

the  curves  also  represent  £prtority  Aviax  and  most  importantly, 
£^R/£ylax.  In  the  plot,  we  consider  two  scenarios,  one  where 
the  clock  period  is  set  to  2,  and  the  other  where  it  is  set  to  5.  As 
expected,  the  attacker  incurs  a  higher  normalized  error  when 
he  wishes  to  estimate  Alice’s  arrivals  with  greater  precision. 
The  curves  represent  the  maximum  estimation  error  that  the 
best  attacker  will  incur  against  any  work-conserving  policy. 
Note  that  there  is  a  relatively  large  gap  between  the  privacy 
performance  of  work-conserving  and  policies  that  are  allowed 
to  idle.  For  instance,  when  Alice’s  arrival  rate  is  less  than 
0.4,  any  work-conserving  policy  can  guarantee  a  privacy  no 
greater  than  just  10%  of  the  privacy  that  can  be  guaranteed  by 
TDMA.  In  [13],  the  authors  state  that  in  most  cloud  computing 
platforms,  the  load  is  typically  less  than  0.2.  In  such  scenarios, 
the  designers  of  the  system  need  to  be  aware  of  the  existence 
and  possible  exploitation  of  the  timing  based  side  channel 
discussed  in  this  work.  In  the  ‘high-traffic  regime’,  the  privacy 
offered  by  the  round-robin  policy  is  comparable  to  that  by 
TDMA.  The  reason  behind  this  is  the  following.  As  stated  in 
Theorem  4.1,  the  maximum  information  that  the  attacker  can 


learn  by  performing  any  attack  against  the  round-robin  policy 
is  the  start  and  end  times  of  the  busy  periods  of  the  scheduling 
system.  When  Alice’s  rate  is  high,  most  of  the  busy  periods 
are  of  extremely  long  duration.  When  busy  periods  are  long, 
there  are  several  possible  arrival  patterns  of  Alice  that  could 
lead  to  the  same  busy  period.  Therefore,  the  attacker  learns 
very  little  by  performing  the  attack,  and  therefore  incurs  a 
large  error.  While  the  curves  can  be  computed  for  all  values  of 
A2  <  1,  doing  so  for  higher  values  of  A2  requires  computation 
of  factorials  of  large  numbers.  Owing  to  the  possible  numerical 
errors  involved  in  these  computations,  we  skip  plotting  the 
curve  in  this  regime. 


VI.  Conclusion 

In  this  work,  we  quantify  the  privacy  offered  by  work- 
conserving  scheduling  policies  by  showing  that  the  round- 
robin  is  a  privacy  optimal  policy  in  this  class,  and  quantifying 
its  privacy  metric.  We  show  that  all  work-conserving  policies 
fare  very  poorly  on  the  privacy  metric.  This  is  especially  true 
in  the  low-traffic  regime.  This  is  because,  when  the  arrival 
rate  from  the  user  is  low,  there  is  typically  only  one  or  no 
jobs  from  her  in  the  buffer  at  any  given  time.  Therefore,  if 
the  scheduler  is  forced  to  serve  jobs  present  in  the  buffer 
without  idling,  the  scheduler  does  not  have  many  options  to 
choose  from,  and  therefore  through  the  process  of  scheduling, 
leaks  some  information  about  Alice’s  jobs  to  the  attacker.  This 
observation  is  consistent  with  our  earlier  results  in  [14]  and 
[15],  where  we  used  a  correlation  based  metric  to  quantify  the 
information  leakage.  We  had  observed  that  although  round- 
robin  did  leak  less  information  to  the  attacker  compared  to 
FCFS,  the  two  performed  similarly  in  the  low  traffic  regime, 
and  both  were  equally  vulnerable.  A  surprising  result  from  our 
analysis  is  that,  even  in  the  medium-traffic  regime,  there  exists 
no  randomized  policy  which  can  guarantee  a  high  estimation 
error  for  the  attacker. 
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Appendix  A 
Proof  of  Theorem  3.1 

The  following  lemma  is  used  to  prove  the  theorem. 

Lemma  A.l:  Fix  an  arrival  process  from  Alice.  Denote  by 
£”  the  arrival  times  of  jobs  from  the  attacker  and  s"  be  the 
sizes  of  these  jobs.  Let  t-f  be  the  departure  times  of  these 
jobs  if  the  scheduler  gave  priority  to  jobs  of  the  attacker. 
For  the  same  set  of  arrivals  from  Alice  and  the  attacker,  let 
£"  be  the  departure  times  of  the  jobs  of  the  attacker  if  the 
scheduler  used  a  policy  P,P  g  WC,  where  WC  is  the  class  of 
work-conserving  polices.  Then,  tt,  for  each  i  is  a  deterministic 
function  of  f",s™  and 

Proof  of  Lemma  A.l:  Let  W(u)  be  the  total  work  in  the 
system  at  time  u.  Note  that  W  (it)  is  the  same  for  all  non-idling 
policies  [9],  Denote  by  Wa{u),  and  Wb{u)  respectively,  the 
total  work  of  Alice  and  Bob  in  the  system  at  time  u  when  the 
scheduler  policy  P.  Then  W(u)  =  Wa(u)  +  Wb{u).  Denote 
by  ( x )  the  fractional  part  of  a  real  number  x,  i.e.,  ( x )  = 
x  —  [x\ . 

Claim  A.la:  The  attacker  can  compute  {W(tf))  for  each  i. 

Proof  of  Claim  A.la:  When  the  scheduler  uses  policy  P, 
suppose  there  are  m  outstanding  jobs  from  the  attacker  which 
have  not  departed  by  time  f.  Let  ji,  j2, . . . ,  jm  be  their 
indices,  i.e.,  job  j i  is  the  job  from  attacker  that  has  arrived 
by  time  t,  and  departs  first  after  time  lt,  j2  is  the  second  job 
that  departs  after  time  t,  and  so  on.  Suppose  tj,  —  ti  <  Sj ,, 
then  at  time  t, ,  the  scheduler  should  have  been  busy  serving 
a  job  from  the  attacker.  In  this  case,  Wa(u)  is  an  integer. 

~  _  m 

Therefore,  (W(U))  =  ( WB(U ))  =  (tjl  -  t »  +  J2  sjk )•  Now> 

k= 2 

suppose  tj,  —  ti  >  Sj,,  then  at  time  t„  the  scheduler  should 
have  been  busy  serving  a  job  from  Alice.  In  this  case,  the 
scheduler  has  to  first  serve  the  job  from  Alice  that  is  in  service 
at  time  f,  and  only  then  can  it  move  on  to  serving  other  jobs. 
Therefore,  job  j\  from  the  attacker  can  only  depart  at  time 
( Wa(U ))  +  q,  for  some  non-negative  integer  q.  Therefore, 

(WA(ti)}  =  ( tj ,  -  Sj,  -  ti),  and  WB{ti)  =  J2  sik-  Now, 

fc= l 


(W(ti))  =  (( Wa(U ))  +  ( Wb(U ))),  which  can  clearly  be 
computed  by  the  attacker.  § 

To  prove  the  result  of  the  lemma,  note  that  V  i, 

ti+ 1  +  (W(ti+ 1))  +  Sj+i,  if  L+i  >  £j 

(9) 

t’i  T  'Si-t-i,  if  ti- (_i  ^  ti  , 

(10) 

If  t,  <  ti+ 1,  then  the  i  +  1th  job  waits  only  for  the  service 
of  the  job  that  is  already  at  the  server,  and  then  immediately 
goes  into  service.  Therefore  equation  (9)  follows.  If  t,  >  fi+1 , 
the  i  +  1th  job  from  the  attacker  goes  into  service  as  soon  as 
the  ith  job  of  the  attacker  gets  served.  Therefore  (10)  follows. 
Also,  t1  =  t\  +  (W{t\))  +  8\.  From  Claim  A.la,  (' W(tj ))  can 
be  computed  by  the  attacker  for  each  i,  and  consequently  if. 


Proof  of  Theorem  3.1:  From  Lemma  A.L  for  any  work- 
conserving  policy  P  used  by  the  scheduler,  the  attacker  can 
always  simulate  the  observations  which  he  would  make  if  the 
scheduling  policy  were  a  priority  policy.  Denote  by  £p’r^ority 
the  estimation  error  incurred  by  the  strongest  attacker  against 
the  priority  policy.  Using  the  same  notation  from  Lemma  A.l, 
for  every  P  €  WC,  we  then  have  the  following: 


E 


{Xk--E[Xk\t?,s?,%]y 


=  E 

<  E 


(xfe-E[Xfc|^S^fVi”]) 

(xfe-E[Xfc|  4"]) 


(11) 

(12) 


where,  (11)  follows  from  Lemma  A.l,  and  (12)  follows  from 
an  elementary  result  from  estimation  theory  which  states  that 
discarding  information  leads  to  an  inferior  estimate.  Therefore, 


min  l^E[(Xfc-E[Xfe| 

N  r 

EE  (xk-E[Xk\ t^s^t^y 


N 


3  k—  1 

+  n  sn  ■  - _ — 

L1  ’*1  •  AT 


< 


-<A 


N 


t™  J  Nc  <X 


k= 1 


and  consequently,  pf  <  <Fpr|ority,  VP  G  WC.  ■ 


Appendix  B 
Proof  of  Lemma  3.2 

For  some  j,  if  i  ■  =  tj  +  e,  i.e.,  if  the  jth  job  from  the 
attacker  goes  into  service  immediately  upon  its  arrival,  then 
the  system  must  be  empty  at  time  tj.  Hence,  tj  marks  the 
start  of  a  busy  period  that  is  initiated  by  an  attacker’s  job. 
On  the  other  hand,  if  i  •  >  t7  +  e,  and  f  ■_1  <  tj  —  1  — 
e,  then,  tj  —  1  —  e  marks  the  start  of  a  busy  period  that  is 
initiated  by  a  job  from  Alice.  Because  every  busy  period  is 
initiated  by  either  a  job  from  Alice  or  the  attacker,  and  because 
the  events  that  are  described  above  occur  only  at  the  start 
of  busy  periods,  the  start  times  of  all  the  busy  periods  can 
be  computed  by  the  attacker,  and  he  can  furthermore  figure 
out  if  these  busy  periods  are  initiated  by  an  attacker’s  job  or 
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Alice’s  job.  To  compute  the  end  times  of  the  busy  periods,  note 
that  the  maximum  time  the  scheduler  stays  idle  between  two 
consecutive  busy  periods  is  less  than  1.  This  is  a  consequence 
of  the  arrival  process  from  the  attacker.  Furthermore,  note  that 
all  the  busy  periods  end  with  the  departure  of  a  job  from  the 
attacker  irrespective  of  whether  it  was  initiated  by  a  job  from 
Alice  or  the  attacker.  Therefore,  the  last  departure  before  the 
start  of  the  next  busy  period  marks  the  end  of  the  current  busy 
period. 

Appendix  C 
Proof  of  Lemma  3.3 

Recall  that  all  of  the  busy  periods  end  with  the  service  of 
a  job  from  the  attacker,  and  in  each  busy  period,  jobs  from 
the  attacker  and  Alice  are  served  alternately.  Hence,  all  the 
busy  periods  that  are  initiated  by  a  job  from  the  attacker 
are  of  duration  (k  +  l)e  +  k,  for  some  non-negative  integer 
k.  Furthermore,  preceding  the  busy  period  initiated  by  the 
attacker’s  job,  there  is  no  arrival  from  either  of  the  users  for 
a  duration  of  1  time  unit.  On  the  other  hand,  all  busy  periods 
initiated  by  a  job  from  Alice  are  of  duration  (fc+l)e+£;+l,  for 
some  non-negative  integer  k.  Preceding  such  a  busy  period  is  a 
period  of  duration  less  than  1  where  there  are  no  arrivals  from 
either  of  the  users.  Therefore,  given  the  end  times  of  the  busy 
period,  the  attacker  can  infer  the  duration  of  the  busy  period 
and  also  if  the  busy  period  was  initiated  by  an  attacker’s  job  or 
Alice’s  job.  Therefore,  if  the  Ith  busy  period  was  initiated  by 
an  attacker’s  job,  departures  in  that  busy  period  occur  at  times 
n  +  e,  n  +  1  +  2e, . . . ,  rl .  If  the  Ith  busy  period  was  instead 
initiated  by  a  job  from  Alice,  then  the  departures  in  that  busy 
period  occur  at  times  r;  + 1  +  e,  r;  +  2  +  2e, . . . ,  r; .  Therefore, 
the  departure  times  of  all  of  attacker’s  jobs  can  be  computed 
knowing  the  start  and  end  times  of  the  busy  periods.  Also, 
because  the  arrival  and  departure  times  of  the  jobs  issued  by 
the  attacker  are  related  by  (3),  the  arrival  times  of  the  jobs  can 
be  computed  as  well. 

Appendix  D 

Proofs  of  Lemmas  4.2  and  4.3 
A.  Proof  of  Lemma  4.2 

We  will  use  induction  to  prove  this  lemma.  First,  note  that 
V\(u)  =  T2\[u)  Vit  G  (0, 1 1 ) .  This  is  because  the  two  systems 
have  the  same  arrivals  till  then.  At  time  4,  the  scheduler  is 
either  busy  serving  a  job  from  Alice,  or  is  idle.  If  it  is  busy, 
then  the  scheduler  waits  till  it  completes  the  service  of  this  job, 
and  then  switches  over  to  serve  Bob.  In  either  case,  Bob’s 
incoming  job  goes  into  service  at  time  4  =  inf{w  >  4  : 
V\ (u)  =  and  the  job  departs  the  system  at  time 

4  =  ft  +  si.  Therefore,  T>\{u)  =  V\(u),\/u  €  (0,4)  and 
because  Alice  does  not  get  any  service  when  the  scheduler 
serves  Bob,  V\{u)  =  V\(ti),\/u  £  (4,4)- 

Statement  of  the  induction:  Given  arrival  times  of  the  first 
k  jobs  from  Bob,  ,  their  sizes  sj,  departure  times  of  the  first 
k  —  1  of  his  jobs,  4* 'k~1\  T>\,  and  V\(u),  Vu  £  (0,  tk_1),  the 
departure  time  of  the  kth  job,  tk,  and  T>\(u),\/u  £  (0 ,tk)  can 
be  computed. 


Proof  of  induction:  The  base  case  for  induction  is  already 
proved.  We  need  to  prove  it  for  some  k  >  1  assuming  it  is 
true  for  all  times  before  that.  The  arrival  time  of  the  kth  job 
from  Bob  falls  into  one  of  the  following  cases: 

Case  1:  tk  <  tk_1.  Note  that  T>\{u)  <  T>\(u),\/u.  This 
is  because  in  the  second  system,  there  are  jobs  from  the 
attacker  along  with  the  jobs  from  Alice.  Also,  2>a(4-i)  =  0, 
where  tk_l  is  an  infinitesimal  small  time  before  tk_1.  There¬ 
fore,  if  T>\[tk_ t)  =  T>\{tk_ r),  then  it  must  be  the  case 
that  V\(tk_ i)  =  0,  implying  that  all  of  Alice’s  jobs  that 
arrived  before  tk_1  have  been  served  by  then.  Therefore, 
the  kth  job  from  Bob  goes  into  service  immediately  and 
departs  the  system  at  time  tk  =  tk_1  +  sk.  In  this  case, 

*>!(«)  =  ^a(*L-i)Vu  e  (4_l5  4). 

If  V\(tk_ i)  <  T>\{tk_l),  at  time  tk_v  there  is  at  least 
one  unserved  job  from  Alice  in  the  system,  which  goes  into 
service  at  time  tk_1.  Therefore,  T>\{u)  =  rD\(tk_1)  +  u  — 

4-i>Vu  £  (t'k-nh-t  +  !)>  and  Va(u)  =  VKt'k-^’Vu  £ 

(4-i  + 1>  4-i  + 1  +  sk )•  4  =  4-i  + 1  +  Sfe- 

Case  2:  tk  >  4-i-  I*1  this  case,  after  serving  the  k  —  1th 
job  from  Bob,  the  scheduler  switches  over  to  Alice  and  serves 
her  jobs  back  to  back  (if  there  are  jobs  to  be  served)  till  the 
kth  job  from  Bob  arrives.  Therefore  Vit  £  (4-i>4)> 

V\{u)  =  min{D\{tk_1)  +  u  -  4_x,  V\{u)}. 

The  time  when  the  kth  job  from  Bob  goes  into  service  is  given 
by  tk  =  inf  {it  >  tk  :  T)\  (tk)  +  u  -  tk  =  \V\(tk)]}.  Then, 
V\(v)  =  V\{tk)  +  u  —  4-, Vm  G  (4,4),  4  =  4  +  sk,  and 
Vl(u)  =  (4),  Vit  G  (4,4). 

B.  Proof  of  Lemma  4.3 

The  available  information  to  the  attacker  when  he  issues  his 
second  job  is  no  more  than  the  time  when  he  issued  his  first 
job,  its  size,  and  its  departure  time.  Therefore,  by  the  result 
of  the  Lemma  4.2,  4  is  a  function  of  4,  si  and  T>\.  By  a 
similar  argument,  4  is  dependent  at  most  on  4, 4,  Si,  S2, 4 
and  f2,  all  of  which  are  just  a  function  of  4,  Si  and  T>\,  and  so 
on.  Before  issuing  his  first  job,  the  attacker  has  no  information 
about  Alice’s  arrivals.  Hence  4  and  si  are  independent  of  any 
function  of  the  arrival  times  of  Alice’s  jobs,  in  particular,  Xk. 
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Abstract — This  paper  presents  the  design  and  implementation 
of  spyware  communication  circuits  built  into  the  widely  used  Car¬ 
rier  Sense  Multiple  Access  with  collision  avoidance  (CSMA/CA) 
protocol.  The  spyware  components  are  embedded  within  the 
sequential  and  combinational  communication  circuit  structure 
during  synthesis,  rendering  the  distinction  or  dissociation  of  the 
spyware  from  the  original  circuit  impossible.  We  take  advantage 
of  the  timing  channel  resulting  from  transmission  of  packets  to 
implement  a  new  practical  coding  scheme  that  covertly  transfers 
the  spied  data.  Our  codes  are  robust  against  the  CSMA/CA’s 
random  retransmission  time  for  collision  avoidance  and  in  fact 
take  advantage  of  it  to  disguise  the  covert  communication.  The 
data  snooping  may  be  sporadically  triggered,  either  externally 
or  internally.  The  occasional  trigger  and  the  real-time  traffic’s 
variability  make  the  spyware  timing  covert  channel  detection  a 
challenge.  The  spyware  is  implemented  and  tested  on  a  widely 
used  open-source  wireless  CSMA/CA  radio  platform.  We  identify 
the  following  performance  metrics  and  evaluate  them  on  our 
architecture:  1)  efficiency  of  implementation  of  the  encoder; 
2)  robustness  of  the  communication  scheme  to  heterogeneous 
CSMA/CA  effects;  and  3)  difficulty  of  covert  channel  detection. 
We  evaluate  criterion  1)  completely  theoretically.  Criterion  2)  is 
evaluated  by  simulating  a  wireless  CSMA/CA  architecture  and 
testing  the  robustness  of  the  decoder  in  different  heterogeneous 
wireless  conditions.  Criterion  3)  is  confirmed  experimentally  us¬ 
ing  the  state-of-the-art  covert  timing  channel  detection  methods. 

I.  Introduction 

Rapid  technological  advances  in  wireless  communication 
and  the  growth  in  the  number  and  diversity  of  applications 
and  services  raise  the  demand  for  data-integrity  and  commu¬ 
nication  security.  The  complexity  of  the  emerging  applications 
and  platforms  introduce  newer  vulnerabilities  that  need  to  be 
carefully  identified,  analyzed,  and  addressed. 

The  emerging  802. xx  standard  protocols  must  be  continually 
enriched  with  security  features  to  meet  the  new  and  more 
critical  application  demands.  For  instance,  the  IEEE  802.22  is 
a  cognitive  radio  standard  being  developed  to  bring  broadband 
access  to  less  populated  rural  areas  by  using  vacant  TV  chan¬ 
nels.  To  address  the  complex  adaptive  nature  required  by  the 
technology,  a  cognitive  radio  device  is  typically  implemented 

This  work  was  supported  in  part  by  AFOSR  under  Grants  FA9550-11-1- 
0016,  727  AF  Sub  TX  0200-07UI,  and  FA9550-10-1-0573,  and  NSF  CCF 
10-54937  CAR  and  CCF  10-65022. 


by  a  general  purpose  computer  processor  that  also  runs  radio 
application  software.  As  a  result,  it  is  susceptible  to  spyware 
and  malicious  software  or  hardware. 

In  this  paper,  we  present  a  new  transparent  and  robust  covert 
communication  scheme  that  can  reliably  spy  the  chip  data  to 
outside  entities  by  exploiting  the  timing  channel  resulting  from 
inter-arrival  times  of  the  legitimate  transmitted  packets.  Our 
scheme  can  be  embedded  within  any  medium  access  control 
(MAC)  protocol  that  avoids  collision  in  packet  retransmissions 
by  using  an  exponential  back-off  rule.  For  instance,  the  IEEE 
802.11  Wireless  Local  Area  Network  (WLAN)  standard’s 
main  MAC  layer  protocol.  Carrier  Sense  Multiple  Access  with 
Collision  Avoidance  (CSMA/CA),  is  an  example  of  a  widely 
used  protocol  that  employs  such  a  rule. 

The  effects  of  the  CSMA/CA’s  back-off  rule  perturbs  packet 
timings  by  virtue  of  the  queuing  and  hence  introduces  a  non¬ 
standard  noisy  channel  that  interferes  with  timing-based  com¬ 
munication  purposes.  Although  the  maximum  rate  of  informa¬ 
tion  exchange  of  such  channels  (i.e.,  capacity)  was  character¬ 
ized  in  [1],  only  recently  have  practical  encoding/decoding 
methods  to  instantiate  the  theoretical  limit  with  low  (i.e., 
linear)  complexity  been  introduced  [2],  However,  to  date, 
there  have  been  no  practical  covert  information  encoding 
methods  with  low  encoder/decoder  complexity  to  instantiate 
the  findings  of  [1],  Here,  we  establish  the  first  reliable  covert 
timing  channel  that  is  also  robust  to  the  effects  of  CSMA/CA’s 
back-off  rule. 

We  identify  the  following  performance  metrics  and  evaluate 
them  on  our  architecture: 

1)  Transparency  of  the  spyware  circuitry:  This  criterion 
pertains  to  the  efficiency  of  implementation  at  the  en¬ 
coder  to  prevent  the  detection  of  the  spyware  circuitry. 

2)  Robustness:  This  criterion  tests  the  robustness  of  the 
communication  scheme  to  heterogeneous  CSMA/CA 
effects  which  act  as  noise. 

3)  Transparency  of  the  covert  channel:  This  criterion  mea¬ 
sures  the  difficulty  of  the  covert  channel  detection. 

We  evaluate  criterion  1)  completely  theoretically.  Criterion 
2)  is  evaluated  by  simulating  a  wireless  CSMA/CA  archi¬ 
tecture  and  testing  the  robustness  of  the  decoder  in  different 
heterogeneous  wireless  conditions.  Criterion  3)  is  confirmed 
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experimentally  using  state-of-the-art  covert  timing  channel 
detection  methods. 

The  main  contributions  of  this  paper  are  as  follows: 

•  We  present  the  first  design  of  spyware  integrated  within 
the  communication  circuitry  of  wireless  CSMA/CA  that 
exploits  the  timing  channel  resulting  from  inter-arrival 
times  of  packets  to  leak  data  from  the  chip. 

•  We  introduce  a  covert  channel  encoding  that  that  utilizes 
the  coding  methodology  developed  in  [2].  This  method¬ 
ology  allows  us  to  encode  timing  information  in  presence 
of  the  worst  non-negative  additive  noise,  the  exponential 
noise.  [3]. 

•  Our  spyware  employs  a  low-complexity  error-correcting 
code  framework  that  is  robust  to  timing  perturbations  re¬ 
sulting  from  the  CSMA/CA’s  collision  avoidance  back-off 
strategy.  This  back-off  strategy  introduces  a  non-standard 
queuing  noisy  channel  that  interferes  with  timing-based 
communication  purposes. 

•  We  show  the  difficulty  of  the  spyware  detection,  both  at 
the  hardware  level  and  at  the  packet  timing  level,  from 
both  theoretical  and  practical  perspectives. 

•  To  substantiate  our  theoretical  findings  and  to  evaluate 
the  overhead,  the  spyware  is  implemented  and  tested  on 
the  widely  used  WARP  wireless  radio  platform  [4]. 

The  remainder  of  the  paper  is  organized  as  follows.  The 
related  literature  is  surveyed  in  Section  II.  The  preliminaries 
are  described  in  Section  III.  Section  IV  introduces  our  theo¬ 
retical  results  for  covert  channel  embedding.  In  Section  V,  we 
devise  the  new  spyware  embedding  algorithm.  Detection  of  the 
spyware  in  circuitry  and  at  the  communication  channel  is  pre¬ 
sented  in  Section  VI.  Comprehensive  experimental  results  for 
implementation,  simulations,  and  detection  are  demonstrated 
in  Section  VII.  We  conclude  in  Section  VIII. 

II.  Related  Work 

Research  in  hardware  malware  has  recently  emerged,  espe¬ 
cially  after  the  DoD’s  Defense  Science  Board  comprehensive 
report  on  the  subject  in  2005  [5].  Much  of  this  work  has 
concentrated  on  detection  of  foundry-inserted  Trojans  (mal¬ 
ware),  which  is  typically  done  by  invasive  or  noninvasive 
study  of  the  post-silicon  ICs  and  comparison  with  the  original 
design  files  to  detect  the  modifications  to  the  manufactured 
chips  [6].  Note  that  the  existing  hardware  Trojan  detection 
mechanisms  are  not  able  to  detect  our  spyware  because  our 
assumptions  are  radically  different  (we  will  elaborate  on  this 
issue  in  Section  VI).  A  number  of  methods  for  protection  of 
ICs  and  third-party  IP  cores  against  piracy  have  been  invented 
[7].  The  threat  of  insertion  of  spyware  in  communication  and 
security  hardware  modules  has  been  identified  [8],  [9],  King 
et  al.  [10]  have  designed  and  implemented  malicious  hardware 
that  can  get  hidden  login  access  or  can  steal  the  security  keys 
on  a  microprocessor.  The  malicious  hardware  is  embedded  at 
the  micro-architecture  level  by  addition  of  a  block  of  gates  and 
does  not  modify  the  internals  of  the  architectural  blocks.  The 
key  advantage  of  our  pre-synthesis  spyware  integration  is  that 
it  is  not  distinguishable/removable  from  the  radio’s  internal 
structure  implementing  the  protocol’s  state-transitions. 


Historically,  timing  channels  are  synonymous  with  covert 
channels  [1 1] — [15].  Covert  channels  are  mechanisms  for  com¬ 
municating  information  in  ways  that  are  difficult  to  detect. 
While  the  focus  of  earlier  work  has  been  mainly  on  disrupting 
or  completely  eliminating  covert  timing  channels  [13],  [16], 
the  recent  work  has  focused  more  on  detection  of  covert 
timing  channels  [17] — [19].  Moreover,  timing  side  channels 
pose  formidable  security  and  privacy  threats  [20]-[22], 

III.  Background 

In  this  section,  we  first  provide  a  brief  background  on  the 
employment  of  the  CSMA/CA  protocol  in  the  widely  used 
IEEE  Wireless  Standard  802.11.  Afterwards,  we  summarize 
the  methods  for  triggering  the  spyware.  Lastly,  the  incentives 
for  designer  spyware  insertion  are  discussed. 

A.  CSMA/CA  in  IEEE  802.11 

The  MAC  layer  of  the  IEEE  Standard  802.11  for  Wireless 
Local  Access  Networks  (WLANs)  uses  CSMA/CA  for  colli¬ 
sion  avoidance  [23]. 

The  main  idea  behind  the  CSMA/CA  collision  avoidance 
strategy  is  that  the  channel  is  always  sensed  before  any 
transmission.  If  the  channel  is  sensed  busy,  the  sender  waits 
until  the  channel  is  idle  and  then  goes  through  a  random  back¬ 
off  period  before  retrying.  The  random  back-off  period  ensures 
fairness  among  contending  transmissions.  More  precisely,  the 
transmission  occurs  only  if,  during  a  fixed  period  of  time, 
equal  to  a  distributed  interface  state  (DIFS),  the  channel  is 
sensed  idle.  If  the  sensed  channel  is  busy  during  or  imme¬ 
diately  after  the  DIFS,  the  station  continues  monitoring  the 
channel  until  the  channel  is  sensed  idle  for  a  DIFS  period. 
At  this  time,  the  CSMA/CA  generates  a  random  back-off 
interval  before  transmitting  to  avoid  the  possible  collisions 
by  minimizing  the  probability  of  collision  with  packets  trans¬ 
mitted  by  other  radios.  Even  if  the  medium  is  sensed  idle 
during  the  DIFS  period,  a  radio  waits  a  random  back-off  time 
between  consecutive  transmission  of  two  packets  to  avoid 
channel  capture.  The  flowchart  in  Figure  1  shows  the  basic 
steps  of  the  CSMA/CA  at  the  transmitting  radio. 


Fig.  1 .  Flow  of  the  CSMA/CA  packet  transmission  with  exponential  back-off 
time. 

Moreover  for  each  packet  transmission,  an  exponential 
back-off  method  is  used  and  to  signal  a  successful  packet 
transmission,  an  ACK  is  transmitted  by  the  destination  radio. 
This  ACK  is  automatically  transmitted  at  the  end  of  the 
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Fig.  2.  Covert  communication:  A  warden  inspects  the  communication,  but  is  unable  to  decode  messages  modulated  by  the  packet  timings. 


packet,  after  a  short-time  inter-frame  space  (SIFS).  Since  the 
combination  of  the  SIFS  and  channel  delay  is  shorter  than 
DIFS,  the  channel  cannot  be  detected  idle  for  a  DIFS  until 
the  end  of  an  ACK. 


B.  Triggering  the  Spyware 

The  spyware  is  activated  upon  the  arrival  of  a  trigger, 
which  may  come  from  either  internal  or  external  sources.  The 
spyware  does  not  need  to  be  active  all  the  time.  We  call  the 
active  duration  of  a  spyware  the  spying  interval.  An  internal 
spyware  trigger  comes  from  within  the  hardware.  The  two 
most  obvious  choices  for  an  internal  incitement  are  the  states 
of  the  registers  in  the  design,  and  the  clock.  The  states  of  the 
register  can  be  utilized  in  a  number  of  ways  -  for  example, 
by  arriving  at  a  certain  internal  state  of  the  communication 
controller,  or  upon  reaching  a  certain  counter  state.  Recent 
research  has  suggested  the  possibility  of  designing  challenging 
(hard-to-detect)  Trojans  and  triggers  [24].  Since  we  implement 
the  trigger  at  the  hardware  level  and  integrate  it  within  the 
design,  it  is  impossible  to  detect  these  types  of  signals  by 
studying  the  radio  output. 

The  external  triggers  have  to  find  a  way  for  the  outside 
source  to  reach  the  spyware.  The  most  common  method  is  via 
the  communication  channel.  Note  that  other  hardware  inter¬ 
faces  to  the  outer  world,  such  as  sensors  or  power  lines,  can  be 
used  as  extrinsic  stimuli.  The  covert  communication  channel 
can  also  be  used  for  signaling  the  trigger.  For  example,  the 
radio  can  be  intentionally  kept  busy  for  certain  time  intervals. 
An  important  observation  is  that  the  external  trigger  does 
not  have  to  be  sent  via  the  covert  channel.  For  example,  the 
receiver  can  be  manipulated  such  that  a  certain  combination 
of  the  input  signals  would  not  create  an  interrupt  to  the  layers 
above.  At  the  same  time,  the  carrier  sense  signal  can  be  raised 
high  by  the  spyware,  disallowing  the  legitimate  messages  to 
route  via  the  communication  channel.  Unless  the  secret  trigger 
combination  is  known  and  the  radio  is  under  physical  tests 
observing  its  activity,  detecting  this  type  of  external  trigger 
is  extremely  difficult,  if  not  impossible.  Note  that  the  cost  of 
triggering  can  be  negligible.  For  example,  very  often  a  wire 
or  a  single  gate  is  used  for  triggering  [25]. 


C.  Incentives  for  Designer  Spyware 


Application  Specific  Integrated  Circuit  (ASIC)  is  the  pre¬ 
dominant  technology  used  for  designing  power-efficient  and 
low-cost  communication  circuitry.  At  least  three  parties  are 
involved  in  a  typical  ASIC  supply  chain  scenario:  a  design 
company,  a  third  party  fabrication  house,  and  the  end  user. 
The  spyware  (Trojan)  discussed  in  this  paper  is  inserted  by 
the  design  company  during  the  circuit  synthesis  which  could 
be  applied  to  both  ASIC  and  Field-Programmable  Logic  Array 
(FPGA)  technologies.  Note  that  the  inclusion  of  designer  spy- 
ware  is  not  limited  to  the  untrusted  companies  for  malicious 
purposes.  Contemporary  designers  often  use  the  backdoors 
implemented  by  the  spyware  for  digital  rights  management 
purposes. 

Several  research  and  development  efforts  in  the  area  of 
hardware  security  have  focused  on  the  problem  of  hardware 
Trojan  where  the  treat  model  has  been  post-synthesis  Trojan 
insertion  by  the  ASIC  fabrication  house  [25],  [26],  Since  post¬ 
synthesis  modification  of  the  design  blueprint  is  prohibitively 
expensive,  the  fabrication  house  could  only  insert  the  added 
Trojan  components  within  the  white  spaces  on  the  blueprint; 
the  white  spaces  are  unavoidable  since  they  are  a  side-product 
of  automatic  synthesis  and  layout  processes.  This  attack  could 
have  the  same  impact  on  the  communication  packets  as  the 
designer’s  spyware.  However,  this  attack  may  be  detectable 
by  the  design  company  who  has  access  to  the  original  design 
specifications  post-synthesis.  The  designer  could  use  the  side- 
channel  tests  to  compare  the  measurements  from  the  fabricated 
chip  with  the  original  design  simulations  [27]. 

Side-channel  tests  are  inapplicable  for  the  designer  spyware 
detection  discussed  in  this  paper.  Our  spyware  is  inserted 
during  the  synthesis,  while  the  Trojan-free  pre-synthesis  spec¬ 
ifications  are  never  shared  with  other  parties  in  the  IC  supply 
chain.  The  spyware  and  its  trigger  can  be  designed  to  be  point 
functions,  and  thus  they  can  be  efficiently  obfuscated  (hidden) 
within  the  exponential  state-space  of  the  design  [7],  [24], 
[28],  As  long  as  the  circuit  performs  the  intended  input/output 
tasks,  the  hidden  functionality  of  the  spyware  cannot  be 
distinguished  post-synthesis. 
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Fig.  3.  Bits  are  encoded  into  the  arrival  times  of  the  packet  sequence  originating  from  the  user’s  packet  stream.  The  packets  are  sent  over  the  wireless 
medium  with  the  augmented  inter- arrival  times. 


IV.  Covert  Communication  with  Timings 

Historically,  timing  channels  are  synonymous  with  covert 
channels  [1 1] — [15],  [17],  [18],  [29].  Covert  channels  are 
mechanisms  for  communicating  information  in  ways  that  are 
difficult  to  detect.  Packet  networks  are  designed  with  the  goal 
of  communicating  through  packet  contents  and  their  headers 
and  not  the  timings.  Hence,  the  timing  channel  induced  by  the 
inter  packet  timings  provides  a  side  channel  that  can  be  utilized 
for  covert  communications.  Figure  2  illustrates  a  covert  timing 
channel  between  an  infected  transmitter  and  a  legitimate 
receiver.  The  warden  (a.k.a  eavesdropper)  sees  the  exchange  of 
packets  but  fails  to  realize  the  covert  communication  in  packet 
timings.  On  the  other  hand,  the  spyware  receiver,  which  has 
side  information  pertaining  to  the  parameters  of  the  encoding 
scheme,  (see  Section  IV-B)  can  decode  the  message  conveyed 
through  timings. 

As  mentioned  in  Section  III,  the  CSMA/CA  protocol  results 
in  random  delays  in  the  packet  inter-arrival  times.  We  will 
model  the  impact  of  the  CSMA/CA  as  a  first-come  first-served 
(FCFS)  queuing  system.  In  particular,  a  special  case  of  a 
FCFS  queuing  timing  channel  is  the  “exponential  server  timing 
channel  (ESTC)”,  where  service  times  are  independent  and 
identically  distributed  (i.i.d.),  and  memoryless:  in  continuous¬ 
time,  they  are  exponentially  distributed  of  rate  //  -  with 
probability  density  function  (PDF)  Psi{s)  =  /re_/is,  and  in 
discrete-time  they,  are  geometrically  distributed  of  rate  //  - 
with  probability  mass  function  (PMF)  Pst{s)  =  /r(l  —  /r)s_1. 
In  either  case,  by  defining  a,  ( di )  to  be  the  time  of  the  /'th 
packet  arrival  (departure)  of  the  queue,  we  have  that: 

jniLi^OO  if  di  >  cii,  Vi  £  {1, ..  .,n 

1 0  otherwise 


C(  A)  = 


Alog2  j, 


A  <  p,  bits/s 


(2) 


P(dn\an)  = 


where  the  maximum  of  (2)  is  achieved  with  an  input  Poisson 
process  at  a  rate  A*  =  e_1/i.  However  it  was  not  clear 
until  recently  [2]  how  to  develop  practical  codes  that  can 
withstand  the  noise  from  the  queueing  channels  and  approach 
the  fundamental  limits  given  by  (2).  In  this  work,  we  will  for 
the  first  time  use  similar  code  constructions  for  the  purpose  of 
covert  communication  over  a  timing  channel  introduced  due 
to  queuing  effects  of  CSMA/CA. 

A.  Encoder  Structure 

The  coding  schemes  of  [2]  involves  an  algebraic  code 
construction  that  allows  for  a  simple  encoder,  thus  lending 
itself  nicely  to  the  design  of  an  in-circuit  spyware.  In  short, 
a  coset  code  construction  [31]  followed  by  shaping  is  used 
[31] — [34].  Specifically,  we  first  use  a  sparse  graph  coset  code 
("•«): 

C  =  {x  :  Hx  =  s} 

where  H  is  an  m  by  n  matrix,  s  is  a  length-m  vector,  and  all 
algebraic  operations  are  defined  over  the  finite  field  Fq.  The 
encoder  module  uses  the  generator-representation  of  the  code, 
where  a  length-k  (k  =  n  —  m )  information  vector  is  mapped  to 
a  length-n  codeword  vector  x  by  use  of  an  n  by  k  generator 
matrix  G  and  a  length-n  vector  p: 

x  =  Gu  +  p. 

Thus,  in  order  to  perform  these  operations,  only  standard 
^ Infinite  field  arithmetic  is  required.  Lastly,  a  “shaping”  operation 


is  performed  to  map  each  xt  £  Fq  to  an  inter-arrival  time 


Si  =  di  -  max(ai,di_i) 


(lb) 


Zi  £ 


using  a  table-lookup: 


Ideally,  in  the  absence  of  any  queuing,  the  information 
in  the  timings  between  successive  packets  could  be  used 
for  unlimited  rates  of  communication.  However,  by  adding 
random  delays  to  the  inter-arrival  times  of  the  packets,  the 
CSMA/CA  protocol  acts  as  “noise”  to  the  packet  timings,  as 
illustrated  in  Figure  3.  It  is  the  job  of  the  spyware  receiver 
to  decode  the  transmitted  message  by  virtue  of  the  departure 
process  of  the  queue.  The  ESTC  has  the  smallest  capacity 
among  all  FCFS  queuing  channels  with  the  same  average 
service  rate  [30],  hence  a  covert  communication  scheme  that 
can  survive  the  ESTC  is  most  likely  robust  to  other  types  of 
queuing  noise. 

The  maximum  rate  of  communication  for  an  ESTC  with  an 
arrival  process  of  rate  A  satisfies  [1] 


—  Pi  (Xi). 


This  idea  exploits  the  well-known  result  [31]  that  the  ensemble 
of  random  linear  coset  codes  produces  i.i.d.  code  symbols 
Xi  uniformly  distributed  over  Fq.  So  II,  is  chosen  to  map 
a  uniformly  distributed  random  variable,  xt  £  Fq,  to  a  non- 
uniform,  exponentially  distributed,  inter-arrival  time  Zi  £  R+. 
This  can  be  done  via  the  use  of  a  “dither”  Di,  used  widely 
in  quantization  for  provably  good  performance  [35],  and  the 
inverse  CDF  F%  (•)  of  an  exponentially  distributed  random 
variable: 


Ui(x) 


Bi(x) 


K{x) 

~Q~ 


+  Di 


Fz\Ui{x)) 


mod  1 


(3a) 

(3b) 
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where  TZ(x)  denotes  interpreting  x  G  Fq  as  a  member  of 
M+.  Note  that  the  details  of  the  dither  are  immaterial  to  the 
encoder/decoder  pair:  Bi  is  simply  a  length-Q  vector. 


B.  Decoder  Structure 


Fig.  4.  A  block  diagram  of  the  spyware  encoder  and  decoder.  The  spyware 
receiver  decodes  the  message  using  the  packet’s  timing  sequence  (red). 

The  coding  scheme  of  [2]  exploits  the  graphical  structure  of 
the  probabilistic  dynamics  of  a  queuing  system  (la)  that  en¬ 
ables  a  low-complexity  message-passing  decoder.  The  decoder 
module  resides  in  software  at  the  spyware  receiver,  and  upon 
receiving  the  packets,  it  decodes  the  transmitted  message  using 
the  inter-packet  times  of  the  departure  process  (as  depicted  in 
Figure  4). 

In  this  scheme,  the  concatenation  of  the  graphical  model  of 
the  error-correcting  code  (specified  by  the  sparse  matrix  H, 
the  syndrome  s,  the  shaping  operator  Bi),  and  the  probabilistic 
dynamics  of  a  queuing  system  lead  to  an  aggregate  graphical 
model  that  is  highly  structured.  This  enables  the  use  of 
standard  low-complexity  message -passing  algorithms  (belief 
propagation,  also  termed  the  sum-product  algorithm)  as  a 
viable  decoding  solution  that  produces  low  bit-error  rates  even 
close  to  the  capacity. 

V.  Implementation 

The  CSMA/CA  which  constitutes  the  communications 
MAC  layer  may  be  implemented  in  software  or  hardware. 
Either  way.  The  MAC  needs  to  interact  with  the  physical 
layer  (PHY)  and  the  hardware  components  that  provide  the 
necessary  information  for  the  correct  operation  of  the  protocol. 
The  PHY  layer  can  be  modified  without  affecting  the  upper 
layer  MAC  or  notifying  the  changes  to  any  of  the  higher 
layers,  or  even  providing  false  information  to  them.  Figure  5 
shows  different  ways  of  interaction  between  the  MAC  and 
the  PHY  layer.  The  MAC  queries  the  PHY  for  the  status 
of  the  medium  and  timing  information.  As  can  be  seen  on 
the  figure,  the  PHY  can  be  modified  to  mislead  the  MAC 
by  sending  false  timing  information  to  it,  or  even  notify  the 
MAC  that  the  medium  is  busy  when  it  is  idle.  Perhaps  the 
most  important  mean  for  spying  the  information  at  this  layer 
is  that  the  front  transmission  and  reception  of  the  packets  must 
be  done  through  the  PHY. 

In  our  implementation  of  the  spyware,  we  modify  the  PHY 
so  that  the  timing  of  the  sent  packets  are  altered  in  order 
to  covertly  send  the  stolen  data  to  the  outside  world.  To 
achieve  this  goal,  we  implement  the  encoder  explained  in 
Section  IV-A.  The  block  diagram  of  the  encoder  circuit  is 
shown  in  Figure  6.  The  encoder  circuit  was  implemented  on  a 
reconfigurable  radio  platform  that  implements  a  programmable 
baseband  architecture.  We  used  the  WARP  platform  devel¬ 
oped  at  Rice  University  for  our  implementation  [4],  This 
programmable  baseband  radio  contains  an  FPGA  board  that 


Fig.  5.  Interaction  between  CSMA/CA  and  the  PHY. 

can  be  reconfigured.  An  advantage  of  the  FPGA  implemented 
in  WARP  (compared  to  the  simple  FPGAs  with  only  lookup 
tables)  is  that  there  it  also  contains  dedicated  adders  and 
multipliers  that  facilitate  implementation  of  various  DSP  and 
coding  algorithms. 

By  programming  the  FPGA,  both  the  generator  matrix  G 
and  the  length-n  vector  p  are  hard  coded  in  the  PHY  layer.  G  is 
generated  for  a  specific  finite  field  Fq.  Algorithm  1  shows  the 
steps  for  constructing  the  circuit  representing  multiplications 
by  G.  First,  a  template  table  T  is  constructed  for  Fq.  T  has 
Q  entries  representing  small  circuits  that  perform  addition  and 
multiplication  of  constants  from  0  to  Q  —  1  with  a  variable  in 
Fq.  The  circuits  in  the  template  table  are  simple  XOR  gates, 
constants,  or  wires.  The  circuit  is  generated  by  the  loop  in 
Fines  5  to  7  in  the  algorithm. 


Fig.  6.  Block  diagram  of  the  encoder  circuit. 

The  final  circuit  is  a  network  of  XOR  gates.  The  output 
of  this  circuit  is  then  added  to  the  constant  vector  p.  This 
step  is  also  hard  coded  in  the  circuit  by  inverting  some  of 
the  bits  of  w  to  generate  x.  Each  element  in  x  represents 
an  index  for  a  specific  inter-arrival  time  value  stored  in  the 
FUT.  The  elements  of  x  are  traversed  using  a  counter.  The 
counter  is  reset  when  the  trigger  is  raised.  Every  time  a  packet 
is  successfully  transmitted,  the  counter  is  incremented  for 
generating  the  next  inter-arrival  time. 

VI.  Spyware  Detection 

The  spyware  can  be  distinguished  either  by  realizing  the 
presence  of  the  spyware  circuitry  on  the  chip  or  by  detecting 
the  covert  timing  channel.  We  next  show  that  the  neither  of 
these  two  measures  are  feasible. 

A.  Detection  of  Spyware  Circuitry 

It  should  be  noted  that  the  circuitry  of  Section  V  is  in¬ 
terwoven  and  integrated  within  the  physical  layer  functions. 
A  High  Fevel  Description  (HDF)  Hardware  language  is  used 
for  programming  the  boards.  The  HDF  code  is  then  compiled 
and  synthesized  to  hardware  components  and  automatically 
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Algorithm  1  Algorithm  for  Encoding  with  the  Generator 

matrix  G _ 

Require:  Q:  the  field  size  to  be  used,  G:  n  x  k  matrix  over 
Fq,  T:  Q  x  Q  matrix  over  Fq. 

1:  Input:  u,  the  input  vector  to  the  circuit  of  size  k 
2:  Output:  w,  the  output  vector  of  the  circuit  of  size  n 
3:  for  i  =  1  to  n  do 
4:  W(i)= 0 

5:  for  j  =  1  to  k  do 

6:  W(i)=W(i)®T(G(i,j),u(j)) 

7:  end  for 

8:  end  for 


placed  on  the  circuit.  We  note  that  the  circuit  elements  used 
for  realizing  the  algorithm  and  blocks  shown  in  the  previous 
section  are  mostly  combinational.  There  are  very  few  sequen¬ 
tial  elements  that  use  the  LUTs,  for  example  the  counters. 
The  overhead  of  the  encoder  is  a  function  of  the  size  of  G. 
While  a  smaller  G  provides  a  smaller  overhead,  there  is  a 
tradeoff  between  the  size  of  G  and  the  accuracy  of  decoding 
of  the  encoded  message.  An  example  implementation  of  the 
encoder  and  the  tradeoff  between  the  overhead  and  accuracy 
of  the  decoder  is  shown  in  Table  I  of  Section  VII-A.  Note  that 
detection  of  spyware  by  the  power  consumption  is  not  feasible 
since  its  impact  can  be  easily  obfuscated,  since  the  spyware 
circuit  can  be  actively  consuming  power  without  sending  any 
information  out.  Therefore,  it  cannot  be  used  for  spyware 
detection  purposes.  The  specific  power  consumption  number 
is  very  dependent  on  the  implementation.  In  our  simulated 
implementation,  it  was  less  than  0.5%  when  the  trigger  was 
constantly  on.  This  number  is  below  the  noise  floor  for  the 
measurable  power  from  the  supply  pin  of  WARP  boards  and 
is  not  detectable. 

The  class  of  attacks  that  are  addressed  in  this  paper  are 
channel-based,  where  the  communication  packet  timings  can 
be  probed.  This  is  because  of  our  spyware  implementation 
scenario.  The  designer  is  the  implementer  of  the  covert  channel 
and  therefore,  others  who  are  users  would  only  utilize  the 
implementation.  We  believe  that  this  is  a  reasonable  assump¬ 
tion  for  at  least  three  reasons.  First,  for  many  products,  the 
manufacturers  may  provide  a  backdoor  to  the  design  that 
would  often  remain  unnoticed.  The  method  provided  in  this 
paper  can  be  used  as  a  possible  backdoor  for  the  implementer. 
The  reason  is  that  the  typical  users  are  not  provided  with  the 
blueprint  of  the  circuitry  to  be  able  to  investigate  the  designs 
as  those  are  often  proprietary  information  and  trade  secrets. 
There  have  been  notable  cases  where  it  was  revealed  that 
the  manufacturer  has  intentionally  implemented  a  backdoor  to 
the  design,  so  this  attack  is  realistic  and  has  been  reported. 
Second,  after  packaging,  the  internals  of  many  electronic 
products  are  clandestine  to  the  users  and  cannot  be  easily 
debugged  and  tested.  It  is  true  that  in  our  implementation, 
we  use  a  programmable  platform  to  evaluate  our  method. 
One  disadvantage  or  reconfigurability  might  be  that  it  can 
be  easily  reprogrammed  and  maybe  even  probed  to  a  certain 
degree.  However,  we  emphasize  that  in  real  products  there  is 
a  possibility  of  implementing  such  channels  in  ASIC  which 


cannot  be  reprogrammed  or  removed.  Third,  the  fabrication 
house  attack  scenarios  are  drastically  different  than  those  in 
our  paper  as  discussed  in  Section  III-C. 

Let  us  also  discuss  a  scenario  in  which  another  entity 
designs  a  circuit  with  exactly  the  same  functionality.  Even  this 
circuit  would  not  provide  a  good  comparison  point,  because 
of  the  degree  of  freedom  in  selecting  the  design  components. 
In  other  words,  there  is  no  certain  measure  for  the  expected 
power  or  timing  that  can  be  computed  for  a  given  high  level 
specification,  so  there  would  be  little  relevance  between  the 
new  entity’s  design  and  the  original  circuitry  without  the 
spyware.  Thus,  the  circuit-level  attack  is  not  applicable  to  our 
spyware  scenario. 

B.  Detection  of  Covert  Communication 

The  countermeasures  to  covert  timing  channels  come  mainly 
in  two  flavors:  detection  and  disruption.  The  detection  of 
covert  timing  channels  is  based  on  statistical  tests  that  can  dis¬ 
tinguish  between  legitimate  and  covert  traffic.  The  disruption 
(e.g.,  jamming)  of  covert  timing  channels  is  done  by  an  active 
warden  that  aims  at  preventing  the  covert  communication 
usually  with  the  undesired  side  effect  of  degraded  system 
performance.  While  the  focus  of  earlier  work  has  been  mainly 
on  disrupting  or  completely  eliminating  covert  timing  channels 
[13],  [16],  the  recent  work  has  focused  more  on  detection 
of  covert  timing  channels  [17],  [18].  This  is  mainly  because 
the  disruption  of  timing  channels  is  usually  costly  (either  in 
terms  of  resources  used  to  active  warden  or  the  degradation  of 
the  system  performance).  One  possible  disruption  is  random 
delaying  of  the  traffic,  which  reduces  the  performance  (e.g. 
by  reducing  the  capacity  or  increasing  the  error  rates)  of 
covert  timing  channels.  Notably,  our  wireless  timing  channel 
is  susceptible  to  injection  of  false  packet  events  by  an  external 
entity.  For  instance,  a  second  node  that  transmits  cover  traffic 
claiming  the  identity  of  the  node  containing  the  spyware. 
However,  such  a  countermeasure  is  both  costly  in  terms  of 
the  attacker  resources  and  the  lost  capacity  of  the  legitimate 
users  of  the  channel.  Therefore,  it  is  highly  desirable  to  employ 
disruption  mechanisms  only  after  one  has  verified  the  presence 
of  a  covert  channel  reliably.  Moreover,  often  cryptographic 
measures  can  deal  with  such  scenarios.  Specifically  in  our 
case  once  the  presence  of  spyware  deduced  covert  channels 
is  detected,  the  warden  will  eliminate  the  covert  channel  by 
eradicating  the  spyware. 

We  check  the  detection  resistance  of  our  scheme  against  the 
current  state-of-the-art  covert  timing  channel  detection  tests. 

•  Kolmogorov-Smirnov  (K-S)  Test.  The  KS-test  is  a  non- 
parametric  and  distribution  free  test  that  evaluates  the 
equality  of  one -dimensional  distributions  of  probability. 
The  objective  is  to  determine  if  two  datasets  (or  one 
dataset  and  a  reference  distribution)  differ  significantly. 
For  the  purpose  of  detection  of  covert  channels,  legiti¬ 
mate  traffic  (in  the  absence  of  covert  communication)  is 
compared  to  the  data  collected  in  a  situation  where  it 
is  suspected  that  a  covert  communication  is  happening. 
We  shall  call  the  samples  from  the  later  situation  as 
test  traffic.  If  the  test  traffic  is  sufficiently  different  from 
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the  legitimate  traffic,  it  is  declared  that  a  covert  channel 
has  been  detected.  The  KS-test  measures  the  maximum 
distance  between  the  empirical  cumulative  distribution 
functions  (CDF)  of  the  legitimate  and  the  test  traffic, 

max  \Ft(x)  -  Fl(x)\,  (4) 

X 

where  Ft(x)  and  Fl(x)  denote  the  empirical  CDFs  of 
the  test  and  legitimate  traffic. 

•  Regularity  Tests.  Regularity  tests  are  based  on  the 
assumption  that  the  covert  traffic  is  more  regular  than  the 
legitimate  traffic.  One  example  of  such  a  test  is  Cabuk 
et  al.’s  heuristic  regularity  test  [17]  which  examines 
whether  the  variance  in  the  inter-arrival  times  remains 
small.  Specifically,  the  test  traffic  is  separated  into  non¬ 
overlapping  windows  of  size  w  packets.  For  each  window 
i,  the  standard  deviation  a,  of  the  inter-arrival  times 
is  computed.  If  the  standard  deviation  of  the  pairwise 
differences  between  all  pairs  <t,;  and  a3  for  i  <  j  remains 
small,  the  traffic  is  considered  highly  regular  and  hence 
the  presence  of  a  covert  channel  is  declared. 

•  Entropy  Tests.  Entropy-based  detection  test  assumes  that 
the  presence  of  a  covert  timing  channel  has  an  effect  on 
the  entropy  of  a  legitimate  signal.  To  examine  the  entropy, 
the  inter-arrival  times  are  binned  into  Q  bins  which  are 
used  to  calculate  the  empirical  probability  densities  of 
patterns  of  size  m  in  the  sequence.  A  pattern  is  defined 
to  be  a  sequence  of  bin  numbers.  The  estimated  entropy 
of  patterns  of  size  m  is  calculated  using  these  probability 
densities.  The  regularity  of  the  data  can  be  measured  from 
the  entropy  rate.  Similarly  a  slightly  different  version 
of  test  using  the  conditional  entropy  estimates  can  be 
calculated.  For  more  details  see  [19], 

Other  special  purpose  detection  tests  that  are  designed  for 
specific  covert  timing  channels  include  Cabuk  et  al.’s  [17] 
e-similarity  for  IP  covert  timing  channels  and  Berk  et  al.’s 
[18]  mean-max  ratio  to  test  for  binary  or  multi-symbol  covert 
timing  channels.  The  mean-max  ratio  test  assumes  that  the 
legitimate  inter-packet  delays  follow  a  normal  distribution 
which  is  often  not  true  for  real  network  traffic. 

In  Section  VII,  we  show  that  none  of  the  above  tests  can 
reliably  detect  the  presence  of  covert  communication. 

VII.  Experimental  Results 

The  hardware-based  covert  channel  spyware  was  imple¬ 
mented  using  the  Wireless  Open  Access  Research  Platform 
(WARP)  [4],  which  is  a  scalable,  extensible  programmable 
wireless  platform  that  enables  advanced  wireless  network  pro¬ 
totyping.  The  WARP  board  contains  reconfigurable  hardware 
(FPGA)  and  can  be  connected  by  a  UBS  port  to  a  PC 
running  Windows.  The  encoder  is  implemented  in  hardware 
and  its  overhead  when  integrated  within  the  WARP  platform 
is  reported  for  different  encoder  sizes  in  Section  VII-A.  The 
decoder  is  implemented  in  software  using  Matlab,  since  we 
do  not  study  the  spying  receiver’s  architecture.  We  also  do 
network  level  simulations  where  the  communication  among 
the  different  nodes  introduces  interfering  network  traffic.  The 
goal  is  to  evaluate  the  accuracy  of  decoding  in  presence  of 


random  interference  as  well  as  to  study  whether  the  covert 
flow  survive  traffic  analysis.  Our  studies  require  two  types  of 
experimental  setups:  hardware  setup  and  software  simulation 
setup.  The  details  of  the  hardware  setup  are  as  follows: 

•  The  Vertix2P7  FPGA  on  the  WARP  platform  is  used 
for  the  encoder  implementation.  The  spyware  encoder  is 
implemented  in  Verilog  and  integrated  within  the  physical 
layer  blocks  on  the  FPGA. 

•  CSMA/CA  MAC  layer  protocol  is  implemented  on  the 
PowerPC  hardcore  processor  embedded  in  the  WARP 
FPGA. 

•  The  physical  layer  blocks  with  the  exception  of  the  analog 
front-end  are  all  implemented  on  the  WARP  FPGA. 

•  A  field  size  Q=  4  and  GF(4)  operations  are  used  for 
encoding. 

•  Three  different  sizes  of  the  encoding  matrix  (G)  are 
implemented:  20  x  8,  40  x  16,  and  80  x  32. 

The  details  of  the  simulation  setup  are  as  follows: 

•  Aside  from  the  spyware  encoder  and  its  intended  receiver, 
there  are  K  other  nodes  (possibly  interfering)  in  the 
wireless  network. 

•  The  transmitting  nodes  send  out  the  packets  according  to 
a  Bernoulli  process,  with  burstiness  probability  p. 

•  The  spyware  information  is  encoded  in  the  timings  of 
the  packets,  as  discussed  in  Section  IV-A,  for  a  fixed 
encoder  rate  of  -  log2  Q  =  0.4  bits/packet,  and  A  = 
0.225  packets/sec. 

•  The  collision  management  scheme  in  the  CSMA/CA 
protocol  introduces  queuing  of  the  packets  -  thus  acting 
as  a  queuing  timing  channel  -  as  discussed  in  Section  III. 

•  The  spyware  decoder  senses  the  transmission  times  of 
the  wireless  radio  -  which  act  as  the  departure  process  of 
the  queue  -  and  uses  the  iterative  probabilistic  decoder 
discussed  in  Section  IV-B. 

The  timing  diagrams  in  Figure  7  demonstrate  the  signal 
timings  resulting  from  our  WARP/FPGA  implementation  of 
the  encoding  matrix  20  x  8.  Figure  7(a)  contains  the  timing 
plot  after  initiating  the  CSMA/CA  protocol  by  a  trigger  signal, 
denoted  as  TX  Ready.  In  Figure  7(b)  we  show  the  initiation 
of  the  CTS  timeout  after  sending  an  RTS  packet.  Figure 
7(c)  illustrates  back-off  after  a  successful  frame  transmission. 
The  control  state  of  the  protocol  is  shown  by  the  bottom 
waveform  in  all  our  timing  diagrams.  As  expected,  the  spyware 
impact  is  covert  and  not  detectable  in  timing  signals.  As  we 
discussed  in  Section  I,  the  performance  of  our  spyware  is 
evaluated  with  regards  to  three  criteria:  1)  transparency  of  the 
spyware  circuitry,  2)  robustness  of  the  communication  scheme 
to  heterogeneous  CSMA/CA  effects,  and  3)  transparency  of 
the  covert  communication  scheme. 

A.  Transparency  of  the  Spyware  Circuitry 

Here,  we  study  the  transparency  of  the  spyware  circuitry 
which  is  done  by  evaluating  the  cost  of  integrating  the  hard¬ 
ware  encoder  on  the  FPGA.  The  field  size  is  set  to  4  for  the 
implementation.  A  template  table  implementing  both  addition 
and  multiplication  in  GF( 4)  is  constructed  and  a  C  program 
is  written  to  generate  the  encoder  corresponding  to  any  matrix 
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Original 

Small  (20  X  8) 

Medium  (40  x  16) 

Large  (80  x  32) 

# 

Overhead 

# 

Overhead 

# 

Overhead 

Slices 

24,518 

24.531 

0.05% 

24,643 

0.50% 

24,993 

1.94% 

Gates 

18,741,445 

18,741,598 

0.00% 

18,742,966 

0.01% 

18.747,184 

0.03% 

TABLE  I 

The  overhead  introduced  by  the  encoder  implementation  on  a  WARP  node  for  three  encoding  matrix  sizes.  The  overhead  is 

REPORTED  IN  TERMS  OF  NUMBER  OF  ADDED  SLICES/GATES  TO  THE  FPGA. 


(a) 

UGJ~LTl_rL 

I _ 


xlJx-aa 


(c) 

Fig.  7.  Timing  diagram  of  WARP/FPGA  implementation. 

G  of  arbitrary  size  in  Verilog.  The  encoder  is  inserted  within 
the  WARP  platform  physical  blocks  on  the  FPGA. 

Table  I  shows  the  overhead  introduced  by  the  encoder 
implementation  on  a  WARP  node  in  terms  of  FPGA  added 
slices  and  gates.  The  first  column  in  Table  I  is  labeled  original 
and  presents  the  original  area  of  the  wireless  node  components 
on  WARP  The  area  is  given  in  terms  of  the  number  of  slices 
used  on  the  FPGA  and  the  estimated  number  of  gates  in  the 
design.  The  next  two  columns  show  the  area  of  the  wireless 
node  on  the  FPGA  when  we  integrate  an  encoder  with  small 
matrix  G  of  size  20  x  8.  The  percentage  overhead  in  area  in 
terms  of  FPGA  slices  is  0.05%,  while  the  overhead  in  terms  of 
the  extra  number  of  gates  is  negligible.  The  next  four  columns 
show  the  area  and  the  percentage  overheads  when  using  a 
matrix  G  with  size  40  x  16,  and  80  x  32,  respectively.  Naturally, 
the  overhead  increases  as  the  size  of  the  matrix  increases.  Due 
to  the  very  small  area  overhead,  the  integration  of  the  encoder 
has  no  impact  on  the  overall  power  of  the  WARP  platform.  The 
spyware  introduced  in  this  paper  is  so  small  compared  to  the 
rest  of  the  circuitry  that  we  are  able  to  insert  it  on  non-critical 
paths  in  our  implementation.  Thus,  our  encoder  does  not  affect 
the  overall  delay  of  the  digital  circuitry  implementing  the  radio 
because  it  does  not  affect  the  critical  path  of  the  digital  design. 
The  critical  path  timing  is  what  determines  the  speed  of  the 


chip  and  is  the  timing  information  often  provided  within  the 
chip  specifications.  Again  we  emphasize  that  the  overhead 
is  only  measurable  by  the  original  designer,  and  the  original 
unaltered  files  are  never  shared  with  any  entity,  including  the 
foundry. 

B.  Robustness  of  the  Communication  Scheme  to  Heteroge¬ 
neous  CSMA/CA  Effects 

In  this  subsection,  we  study  the  robustness  of  the  spyware 
communication  in  terms  of  the  symbol  error  rate  (SER)  at 
the  decoder.  The  SER  is  affected  by  the  queuing  effects 
introduced  by  the  CSMA/CA  due  to  varying  network  condi¬ 
tions.  We  generate  multiple  levels  of  interfering  traffics  on 
the  network  by  simulating  different  number  of  nodes  that 
communicate  in  the  range  of  both  the  sender  and  receiver. 
It  should  be  noted  that  the  only  purpose  of  the  network 
traffic  is  to  introduce  interference  that  can  affect  the  times 
of  arriving  packets  carrying  the  spyware  information.  Table  II 
shows  the  average  SER  at  the  decoder.  These  average  SER 
values  are  shown  for  various  numbers  of  interfering  nodes, 
namely  values  of  K  £  {5, 10, 15},  as  well  as  burstiness 
probability  p  £  {0.4, 0.6}  for  three  encoding  matrix  sizes 
( k,n )  £  {(8, 20);  (16, 40);  (32,  80)}  corresponding  to  small, 
medium,  and  large  on  the  table  respectively. 

Although  the  SER  values  are  only  on  the  order  of  10  , 

we  bring  attention  to  the  specific  context  of  this  application: 
a  short  code-length,  externally  triggered,  and  hardware  imple¬ 
mentation.  Since  this  approach  uses  linear  coset  codes  (which 
have  extremely  small  undetected  error  probability  [36]),  an 
error  event  is  essentially  equivalent  to  the  condition  Hx  s. 
Thus,  the  spyware  receiver  can  simply  externally  retrigger  the 
transmission  at  a  random  time  in  the  future.  As  it  can  be 
expected.  Table  II  shows  that  as  the  length  of  the  code  (size 
of  the  encoding  matrix)  increases,  the  average  symbol  error 
rate  at  the  decoder  improves. 

Figure  8  illustrates  boxplots  of  the  SER  across  multiple 
trials.  The  lower  edge  of  the  rectangular  plot  corresponds  to 
e.25,  the  25th  percentile;  the  upper  edge  pertains  to  e.75,  the 
75 th  percentile.  The  50th  percentile,  or  median,  appears  as  the 
horizontal  line  between  the  lower  and  upper  bounds.  The  upper 
‘tail’  of  the  box  corresponds  to  the  vertical  line  extending 
beyond  the  75 th  percentile,  which  is  of  length  l-5(e.75  — e.25). 
All  outliers  extending  beyond  the  ‘tail’  are  explicitly  drawn 
with  the  “+’  symbol.  We  see  from  Figure  8  that  the  statistical 
structure  of  the  SER  is  highly  concentrated  near  0,  in  such 
a  way  that  cannot  be  evidenced  by  merely  the  mean  SER. 
Specifically,  the  25th  percentile  of  the  errors  is  0.  In  terms 
of  externally  triggered  retransmissions,  this  means  that  simple 
“majority-rules”  decoding  schemes  will  work  particularly  well 
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with  this  approach.  More  specifically,  while  the  average  SER 
is  of  the  order  of  10  ,  around  25%  of  the  time,  the  decoder 

receives  symbols  with  zero  error  and  thus  a  majority  decoding 
rule  after  2  or  3  transmissions  will  result  in  perfect  decoding. 


Large  Sized  Matrix,  Burstiness  Probability  =  0.4 


No.  of  Nodes 

Large  Sized  Matrix,  Burstiness  Probability  =  0.6 


No.  of  Nodes 


Fig.  8.  Boxplots  of  the  symbol  errors  across  multiple  trials,  using  the  large 
encoding  matrix  with  burstiness  probabilities  p  S  {0.4,  0.6}  with  200  runs. 


Number 
of  nodes 

Burstiness  prob.  =  0.4 

Burstiness  prob.  =  0.6 

Small 

Medium 

Large 

Small 

Medium 

Large 

5 

.16 

.08 

.08 

.13 

.09 

.05 

10 

.14 

.10 

.09 

.14 

.10 

.07 

15 

.14 

.09 

.08 

.13 

.09 

.08 

TABLE  II 

Average  SER  of  the  spyware  decoder  vs.  number  of  congesting 
nodes.  The  study  is  done  for  different  probabilities  of  sending 

PACKETS  OVER  THE  WIRELESS  MEDIUM,  AND  THE  MATRIX  SIZES. 


C.  Transparency  of  the  Covert  Communication  Scheme 
In  this  subsection,  we  study  the  detectability  of  the  covert 
information  hidden  in  the  network  traffic.  This  is  done  by 
generating  regular  traffic  and  comparing  it  to  spyware’s  covert 
traffic  for  various  number  of  interfering  nodes.  The  simulated 
traffic  was  generated  and  tested  using  Matlab  on  a  64-bit 
Six-Core  AMD  Opteron(tm)  Processor  8431.  To  test  the 
visibility  of  covert  traffic,  we  compared  two  types  of  simulated 
traffic:  (i)  Regular  traffic:  simulated  traffic  generated  using  a 
Bernoulli  distribution  with  burstiness  probability  p  with  no 
covert  message  embedded,  and  subjected  to  the  CSMA/CA 
back  off;  (ii)  Covert  traffic:  simulated  traffic  with  a  covert 
message  embedded.  This  covert  message  was  embedded  using 
the  scheme  in  section  IV-A  and  the  specifications  in  the 
beginning  of  section  VII  and  in  section  VII-C.  This  traffic  was 
then  subjected  to  the  CSMA/CA  queuing  effects  for  K  number 
of  interfering  nodes.  Inter-arrival  times  from  the  legitimate 
and  covert  traffic  were  used  as  inputs  to  three  covert  channel 
detection  tests:  Kolmogorov-Smirnov,  regularity,  and  entropy 
tests  of  Section  VI-B. 


Number 
of  packets 

Pd 

Small 

Medium 

Large 

250 

0.0357 

0.2820 

0.9022 

500 

0.0164 

0.0280 

0.3288 

1000 

0.0114 

0.0195 

0.0258 

2000 

0.0106 

0.0122 

0.0206 

TABLE  in 

Detection  ( PD )  probabilities  for  using  a  Komolgorov-Smirnov 

TEST  ON  THE  PACKET  INTER-ARRIVAL  TIMES  (FALSE  ALARM 
PROBABILITY  FIXED  AT  1%.) 


Table  III  gives  the  results  for  the  Kolmogorov-Smirnov  test 
when  a  covert  message  was  inserted  in  flows  consisting  of 
10000  packets.  In  each  of  these  flows,  we  randomly  injected 
an  external  trigger  activating  covert  transmission  consisting 
of  n  =  20  (small),  n  =  40  (medium),  or  n  =  80  (large) 
consecutive  packets  in  every  interval  consisting  of  250,  500, 
1000  or  2000  packets.  In  all  cases,  the  threshold  for  the 
hypothesis  test  was  the  deviation  of  the  Komolgorov-Smirnov 
statistic  beyond  the  99%  confidence  interval,  and  the  results 
are  an  average  of  over  10000  trials.  When  the  covert  message 
was  inserted  in  every  1000  or  2000  packets,  our  approach  was 
successfully  undetectable  -  the  detection  probability,  Pjj  and 
the  false  alarm  probability,  Pfa  (fixed  at  1%)  are  close  to 
one  another.  When  injecting  the  covert  transmission  within 
every  250  and  500  packets,  the  transmission  becomes  more 
visible,  although  the  detection  rates  remain  low  except  when 
n  =  80  over  250  packets,  in  which  case  most  of  the  covert 
transmissions  are  detected. 


Number 
of  packets 

Pd  \ 

Small 

Medium 

Large 

250 

0.315 

0.874 

0.948 

500 

0.147 

0.370 

0.885 

1000 

0.058 

0.108 

0.228 

2000 

0.029 

0.065 

0.105 

TABLE  IV 

Detection  ( PD )  probabilities  for  the  regularity  test  (window 

SIZE  W  =  100  PACKETS  AND  FALSE  ALARM  PROBABILITY  FIXED  AT  1%). 


Table  IV  illustrates  the  results  for  the  regularity  test  with 
non-overlapping  windows  of  size  w  =  100  packets  over  a 
collection  of  250,  500,  1000  or  2000  packets.  The  results 
are  averaged  over  a  1000  trials  and  the  threshold  was  set 
beyond  the  99%  confidence  interval.  The  number  of  possibly 
interfering  nodes  and  the  burstiness  probability  were  chosen 
to  be  K  =  5  and  p  =  0.4,  respectively.  Again,  we  randomly 
injected  an  external  trigger  activating  covert  transmission  of 
n  =  20  (small),  n  =  40  (medium),  or  n  =  80  (large) 
consecutive  packets.  As  mentioned  earlier  in  Section  VI-B, 
the  regularity  test  of  Cabuk  et  al.  [17]  examines  whether 
the  traffic  is  more  regular  than  it  is  expected  with  legitimate 
traffic.  Specifically,  it  tests  whether  the  variance  of  inter-arrival 
times  remains  small.  As  seen  in  Table  IV,  the  regularity  test 
completely  fails  to  detect  the  covert  communication  when  the 
covert  message  was  inserted  in  every  1000  or  2000. 

In  the  following  entropy  detection  tests,  in  accordance  with 
recommendations  of  [19],  we  used  bin  sizes  of  Q  =  5 
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and  Q  =  65,  536,  subsequently  for  the  conditional  entropy 
and  the  entropy  tests.  Tables  V  and  VI  shows  the  detection 
probabilities.  I'd  (Pfa  fixed  at  1%)  for  the  two  entropy- 
based  detection  tests  on  an  infected  traffic  consisting  of  10000 
packets,  where  the  covert  transmission  was  inserted  in  every 
interval  consisting  of  250,  500,  1000  or  2000  packets.  The 
external  trigger  activating  covert  transmission  of  n  =  20 
(small),  n  =  40  (medium),  or  n  =  80  (large)  consecutive 
packets  was  done  at  random  positions  in  each  interval.  The 
threshold  was  set  beyond  the  99%  confidence  interval  and  the 
results  are  an  average  over  1000  trials.  The  number  of  possibly 
interfering  nodes  and  the  burstiness  probability  were  chosen 
to  be  K  =  5  and  p  =  0.4,  subsequently. 


Number 
of  packets 

Pd 

Small 

Medium 

Large 

250 

0.206 

0.319 

0.419 

500 

0.144 

0.245 

0.441 

1000 

0.058 

0.082 

0.178 

2000 

0.061 

0.064 

0.090 

TABLE  V 

Detection  probabilities,  Pd  for  entropy  test  on  the  packet 

INTER-ARRIVAL  TIMES  IN  PRESENCE  OF  RANDOM  TRIGGER  (FALSE 
ALARM  PROBABILITY  FIXED  AT  1%.). 


Number 
of  packets 

Pd 

Small 

Medium 

Large 

250 

0.248 

0.646 

0.877 

500 

0.094 

0.232 

0.656 

1000 

0.043 

0.050 

0.148 

2000 

0.046 

0.056 

0.066 

TABLE  VI 

Detection  probabilities,  Pd  for  conditional  entropy  test  on 

THE  PACKET  INTER- ARRIVAL  TIMES  IN  PRESENCE  OF  RANDOM  TRIGGER 
(FALSE  ALARM  PROBABILITY  FIXED  AT  1%.). 

To  see  the  effect  of  number  of  possibly  interfering  streams 
on  the  entropy-based,  we  tested  the  detection  and  false  alarm 
probabilities  when  n  =  20  (small),  n  =  40  (medium),  or 
n  =  80  (large)  packets  containing  a  covert  transmission 
were  sent.  In  this  case,  we  did  not  embed  the  packets  in  a 
larger  transmission  sequence.  Table  VII  illustrates  the  results 
for  K  =  5, 10, 15  interfering  nodes  still  with  a  burstiness 
probability  of  p  =  0.4  for  the  conditional  entropy  case.  The 
results  for  entropy  test  were  similar  and  hence  omitted  for 
sake  of  brevity.  It  can  be  seen  that  as  number  of  interfering 
nodes  increase  the  detection  performance  of  the  conditional 
entropy  test  decreases.  One  possible  explanation  is  that  be¬ 
cause  CSMA  back  off  protocol  engages  more  often  to  ensure 
collision  avoidance,  more  packet  timings  are  altered  from  their 
regular  pattern,  resulting  in  more  randomness.  This  increase 
randomness  in  the  streams  seems  to  impair  the  detection  test. 

As  clearly  demonstrated  by  our  simulations,  none  of  the 
three  detection  tests  can  reliably  detect  presence  of  covert 
communication.  In  fact,  often  times  the  false  alarm  rates  are 
comparable,  if  not  higher,  than  the  detection  rates.  However, 
note  that  the  best  transparency  results  are  achieved  when 
smaller  number  of  covert  packets  are  inserted  more  sporadi¬ 


cally  into  the  traffic,  thus  effectively  reducing  the  transmission 
rate  of  the  covert  communication. 

VIII.  Conclusion  and  Future  Work 

We  have  shown  a  new  methodology  for  the  design  and 
implementation  of  spyware  communication  circuits  that  ex¬ 
ploits  the  CSMA/CA  MAC  protocol  properties  to  covertly  leak 
the  chip  data  to  an  intended  spyware  receiver.  Our  proposed 
scheme  is  robust  to  the  collision  avoidance  techniques  in  the 
CSMA/CA  protocol  of  the  MAC  layer.  We  interweave  the 
encoder  in  the  states  and  transitions  of  the  wireless  transmitter 
pre-synthesis,  such  that  distinction  and  uncoupling  of  the 
spyware  from  the  radio’s  physical  layer  is  impossible.  The 
spying  act  is  initiated  upon  an  occasional  external  or  internal 
trigger.  The  proof-of-concept  implementation  is  demonstrated 
on  the  widely  used  Wireless  Open  Access  Research  Platform 
(WARP).  Three  metrics  were  used  for  evaluation  of  the 
spyware  performance:  1)  efficiency  of  the  encoder  implemen¬ 
tation,  which  was  theoretically  demonstrated;  2)  robustness  of 
the  communication  method  to  CSMA/CA  effects,  which  was 
done  by  simulating  the  wireless  architecture  across  a  range  of 
parameters;  3)  difficulty  of  covert  channel  detection  that  was 
shown  to  be  resilient  against  the  known  covert  timing  chan¬ 
nels  detection  methodologies.  While  there  exists  a  trade-off 
between  accuracy,  overhead  and  detectability,  our  experimental 
results  show  that  our  implemented  spyware  simultaneously 
meets  the  desired  above  metrics. 

Some  possible  future  directions  include  detection  of  the 
“trigger”  time,  or  the  time  at  which  covert  communication 
begins.  One  may  think  about  implementing  the  spyware  in 
other  timing  aspects  of  the  wireless  network  such  as  rate 
adaptation  or  sleeping  mode.  It  can  potentially  be  done  by 
the  use  of  “change  point  detection”  approaches  [37]  where 
they  find  the  location  in  time  where  inter-arrival  times  drawn 
from  a  distribution  Fq  switch  to  a  distribution  F\. 

However,  if  the  circuit-level  design  criterion  permits  more 
sophisticated  shaping  schemes,  even  the  change-point  detec¬ 
tion  will  fail.  For  instance,  the  encoder’s  shaping  technique 
(3)  of  Section  IV  can  be  used  to  design  provably  undetectable 
covert  communication  schemes.  At  a  high  level,  the  idea  is 
that  the  dithering  scheme  discussed  in  Section  IV  has  two 
important  properties: 

•  as  we  are  using  a  random  sparse  graph  linear  coset  code, 
it  follows  from  [31]  the  x,t  symbols  are  independent  and 
identically  distributed. 

•  the  dithering  technique  in  (3)  allows  each  Z{  to  be  i.i.d. 
with  distribution  given  by  Fz{z)  used  in  the  shaping. 

It  follows  theoretically  that  any  statistical  test  that  models 
normal  traffic  as  i.i.d.  with  inter-packet  timings  of  distribution 
Fz,  will  provably  fail.  More  specifically,  the  inverse  CDF, 
Fy[  (■),  can  be  simply  shaped  to  any  desirable  legitimate 
traffic  model,  as  long  as  the  inter-packet  times  are  i.i.d. 
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Abstract  We  introduce  a  router-based  filtering  technology  aimed  to  enhance  the  se¬ 
curity  and  performance  of  cloud  computing.  When  this  technology  is  integrated  with 
cloud  auditing  methods,  it  can  make  use  of  the  cloud  auditing  information  to  detect 
malicious  intrusion  and  traffic  anomalies,  and  define  appropriate  filtering  rules  that 
can  be  exchanged  between  routers  in  the  network,  such  that  it  can  filter  malicious 
traffic  early  and  reroute  the  excessive  legitimate  requests  to  other  suitable  replicated 
servers.  We  first  give  an  overview  of  the  specification  and  generation  of  filtering 
rules  used  by  routers.  Then  we  present  a  theoretical  model  to  find  the  best  locations 
for  hardware  routers  in  a  network  to  block  malicious  traffic,  and  introduce  how  to 
integrate  this  theoretical  model  with  cloud  auditing  techniques.  Finally,  we  will  dis¬ 
cuss  the  experiments  conducted  to  evaluate  the  reliable  theoretical  model  and  the 
dynamic  programming  based  algorithm  for  router-based  filtering  and  rerouting  in 
cloud  computing.  The  results  validate  the  effectiveness  and  benefits  of  our  theoreti¬ 
cal  model  and  algorithm. 


1  Introduction 


Cloud  computing  has  attracted  plenty  of  attentions  and  interests  because  it  real¬ 
izes  the  long-desired  goal  of  on-demand  network  access  to  a  scalable  shared  pool 
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of  computing  resources,  such  as  servers,  storage,  applications,  platforms,  and  net¬ 
works.  However,  along  with  cloud  computing  also  come  concerns  about  security 
issues,  such  as  authentication,  integrity,  privacy,  and  availability.  In  this  chapter,  our 
main  focus  is  to  address  the  availability  issues. 

Availability  in  cloud  computing  can  be  considered  from  two  aspects:  communi¬ 
cation  channel  availability  and  cloud  resource  availability.  Communication  channel 
availability  is  gaining  more  significance  as  more  and  more  services  are  migrating  to 
the  cloud.  Since  clients  entrust  cloud  servers  to  meet  their  storage  and  computation 
needs,  it  is  important  that  clients  can  get  the  needed  service  when  they  need  it.  In 
particular,  many  real-time  cloud  applications  have  emerged  in  these  past  few  years, 
such  as  teleconference,  remote  surgery,  massively  multi-player  online  games,  and 
so  on.  Real-time  delivery  of  the  service  is  paramount  for  these  time-sensitive  ap¬ 
plications  [1].  If  the  cloud  in  question  is  a  public  cloud,  the  clients  and  the  servers 
are  not  in  the  same  domain;  in  this  case  the  Internet  is  required  to  deliver  the  ser¬ 
vice.  However,  the  reliability  of  the  Internet  is  often  impacted  by  rampant  DoS  and 
DDoS  attacks.  If  the  unwanted  packets  from  these  attacks  are  allowed  to  traverse  the 
routers  and  links  in  transit  to  their  destinations,  they  can  consume  huge  amount  of 
router  bandwidth  and  link  bandwidth,  and  disrupt  the  timely  delivery  of  legitimate 
packets.  On  the  other  hand,  cloud  resource  availability  is  no  less  important  than 
communication  channel  availability.  Cloud  service  providers  try  their  best  effort  to 
analyze  and  predict  the  pattern  of  client  needs,  and  allocate  the  cloud  resources  ac¬ 
cordingly.  However,  the  level  of  service  requests  from  clients  is  always  dynamic. 
To  ensure  that  cloud  resources  can  continue  to  meet  client  needs,  the  cloud  service 
provider  must  be  able  to  quickly  readjust  the  allocation  of  cloud  resources  when 
facing  excessive  level  of  service  requests,  or  reroute  service  requests  to  replicated 
servers  that  provide  backup  services. 

Our  goal  in  this  book  chapter  is  to  introduce  a  router-based  filtering  technology 
aimed  to  enhance  the  availability  and  performance  of  cloud  computing.  When  this 
technology  is  integrated  with  cloud  auditing  methods,  it  can  make  use  of  the  cloud 
auditing  information  to  detect  malicious  intrusion  and  traffic  anomalies,  and  define 
appropriate  filtering  rules  that  can  be  exchanged  between  routers  in  the  network, 
such  that  it  can  filter  malicious  traffic  early  and  reroute  the  excessive  legitimate 
requests  to  other  suitable  replicated  servers. 

This  chapter  is  organized  as  follows.  We  first  give  an  overview  of  the  specifica¬ 
tion  and  generation  of  filtering  rules  used  by  routers.  Then  we  will  present  a  theo¬ 
retical  model  to  find  the  best  locations  for  hardware  routers  in  a  network  to  block 
malicious  traffic,  and  introduce  how  to  integrate  this  theoretical  model  with  cloud 
auditing  techniques.  Finally,  we  will  discuss  the  experiments  conducted  to  evaluate 
the  reliable  theoretical  model  and  the  dynamic  programming  based  algorithm  for 
router-based  filtering  and  rerouting  in  cloud  computing.  Our  experimental  results 
validate  the  effectiveness  and  benefits  of  our  theoretical  model  and  algorithm. 


Improving  Cloud  Performance  with  Router-based  Filtering 


3 


Fig.  1  Router-based  filtering  and  redirection. 


2  Router-based  Filtering  and  BGP  Flow-Spec  Rules 

Routers  are  originally  designed  to  forward  packets  toward  their  specified  destina¬ 
tion  according  to  the  routing  table,  which  contains  entries  specifying  the  routing 
information  they  exchange  with  adjacent  routers.  However,  it  is  also  suitable  to  use 
routers  for  access  control  and  filtering  [4]  since  packets  need  to  pass  through  routers 
on  their  way  to  the  destination.  Compared  to  using  traditional  firewalls  installed  at 
the  end  host  to  filter  malicious  unwanted  traffic,  using  routers  for  filtering  has  the 
potential  to  filter  the  malicious  unwanted  traffic  at  a  location  closer  to  the  source, 
therefore  saving  the  bandwidth  that  would  otherwise  be  consumed  by  the  malicious 
unwanted  traffic.  An  additional  benefit  is  that  using  routers  for  filtering  allows  for 
the  realization  of  multiple  instantiations  of  dynamic  filtering  rules  because  routers 
frequently  communicate  with  neighboring  peer  routers  to  advertise  new  routes  or 
route  changes  and  update  their  routing  tables.  Routers  can  disseminate  filtering  rules 
along  with  routing  information  update  messages.  If  their  routing  tables  are  appropri¬ 
ately  updated,  routers  can  be  instructed  to  discard  malicious  packets  immediately, 
limit  the  rate  of  malicious  packets,  redirect  legitimate  packets  to  a  different  legiti¬ 
mate  server,  or  forward  them  to  a  security  center  for  further  analysis  and  diagnosis. 
Fig.  1  illustrates  the  operation  of  this  router-based  filtering  and  redirection  technol¬ 
ogy. 

In  order  to  coordinate  the  routers  in  the  Internet  to  perform  the  task  of  early  filter¬ 
ing  of  malicious  unwanted  traffic,  we  need  a  means  for  routers  to  exchange  filtering 
rules  with  adjacent  routers.  The  flow  specification  (Flow-Spec)  rules  defined  for  the 
Border  Gateway  Protocol  (BGP)  [13]  in  [9]  provide  such  a  way.  If  filtering  rules 
are  defined  as  Flow-Spec  rules,  they  can  be  passed  using  BGP  UPDATE  messages. 
Since  the  purpose  is  to  filter  the  malicious  unwanted  traffic  closer  to  the  source,  and 
BGP  is  the  de  facto  inter-domain  routing  protocol,  BGP  UPDATE  message  is  the 
appropriate  message  to  use  to  forward  the  filtering  rules  upstream  toward  malicious 
traffic  sources. 
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BGP  is  used  to  exchange  routing  information  across  different  domains  in  the 
Internet  and  is  the  protocol  used  between  ISPs.  A  domain,  or  autonomous  system 
(AS),  is  a  network  or  group  of  networks  under  a  common  administration  and  with 
common  routing  policies.  After  two  peer  BGP  routers  establish  a  connection,  they 
will  use  an  OPEN  message  to  set  up  initial  parameters,  and  exchange  their  rout¬ 
ing  table.  The  two  BGP  routers  will  send  periodical  KEEPALIVE  messages  if  there 
is  no  change  in  the  routes,  and  use  UPDATE  message  to  convey  any  route  change 
information  to  the  other  router.  BGP  UPDATE  messages  contain  a  field  called  Net¬ 
work  Layer  Reachability  Information  (NLRI).  The  Flow-Spec  rule  is  defined  as  one 
new  type  of  NLRI,  so  that  it  can  be  carried  by  BGP  Update  messages.  Formally, 
a  Flow-Spec  rule  is  an  n-tuple  consisting  of  several  matching  criteria  that  can  be 
applied  to  IP  traffic.  The  matching  criteria  of  a  Flow-Spec  are  specified  by  encoding 
various  characteristics  of  the  offending  traffic,  including  destination  prefix,  source 
prefix,  IP  protocol,  port,  destination  port,  source  port,  ICMP  type,  ICMP  code,  TCP 
flags.  Packet  length,  DSCP,  and  Fragment.  If  a  given  packet  satisfies  the  conjunction 
of  all  the  specified  criteria  of  a  defined  Flow-Spec,  it  is  considered  a  match  to  the 
Flow-Spec  and  should  be  filtered. 

A  Flow-Spec  received  from  an  external  domain  needs  to  be  validated  against 
unicast  routing  before  being  accepted  and  installed.  The  underlying  principle  of 
validation  is  to  check  that  a  BGP  router  from  a  neighboring  AS  that  advertises  a  best 
unicast  route  for  a  destination  can  only  advertise  a  Flow-Spec  rule  which  contains 
a  more  or  equally  specific  destination  prefix.  Moreover,  neighboring  domains  need 
to  establish  trust  with  each  other  that  they  will  only  advertise  valid  reachability 
information  and  will  not  advertise  an  NLRI  on  behalf  of  a  prefix  for  which  it  does 
not  provide  service;  otherwise  it  may  cause  a  denial-of-service  attack  on  that  given 
prefix.  More  details  about  the  validation  of  the  Flow-Specs  and  the  establishment  of 
trust  between  neighboring  domains  are  discussed  in  [9]. 

As  of  now,  BGP  Flow-Spec  has  received  support  from  multiple  major  vendors 
such  as  Juniper  [10]  and  Arbor.  Cisco  has  also  announced  plans  to  incorporate  BGP 
Flow-Spec  into  their  routers  [5],  Although  direct  Flow-Spec  support  is  not  yet  avail¬ 
able  in  Cisco  routers,  it  is  possible  to  use  Ciscos  existing  Access  Control  List  (ACL) 
to  emulate  a  subset  of  Flow-Specs  features.  In  our  own  research  projects,  we  have 
successfully  implemented  BGP  Flow-Spec  rules  on  a  routing  software  suite  called 
Quagga  [12],  and  have  also  developed  Tel  scripts  which  allow  Cisco  routers  to  ex¬ 
change  ACL  rules  with  other  routers  and  install  received  rules. 

In  the  following  subsections,  we  continue  to  introduce  the  filtering  actions  asso¬ 
ciated  with  Flow-Spec  rules  and  the  factors  associated  with  the  router-based  early 
filtering  approach. 


2.1  Filtering  Actions 

A  router  continues  to  compare  the  received  traffic  with  the  Flow-Spec  rules  in  its 
rule  base  according  to  their  order  in  the  rule  base,  until  a  matching  rule  is  found  or 
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the  end  of  the  rule  base  is  reached  with  no  matching  rule  being  found.  When  the 
router  finds  a  matching  Flow-Spec  rule,  i.e.  the  traffic  matches  all  the  criteria  of  the 
rule,  the  router  will  apply  the  action  specified  in  the  rule  on  the  traffic.  The  BGP 
Flow-Spec  standard  [9]  specifies  a  minimum  set  of  typical  filtering  actions  beyond 
the  default  action  in  which  the  traffic  matching  the  Flow-Spec  rule  in  question  is 
accepted.  Four  types  of  BGP  extended  community  values  are  defined  to  standardize 
the  minimum  set  of  filtering  actions,  which  are  described  as  follows. 

The  first  type  is  called  traffic-rate  extended  community,  which  is  often  used  for 
policing  applications.  The  encoding  is  6  bytes.  The  first  2  bytes  stand  for  an  AS 
number,  while  the  remaining  4  bytes  carry  rate  information  in  IEEE  floating  point 
format,  with  the  units  expressed  in  bytes  per  second.  The  traffic  in  question  is  totally 
filtered  if  the  rate  is  specified  as  0. 

The  second  type  is  called  traffic-action  extended  community,  which  consists  of  6 
bytes  but  currently  only  the  2  least  significant  bits  of  the  6th  byte  (from  left  to  right) 
are  defined.  When  the  Terminal  Action  bit  (bit  47)  is  set,  subsequent  filtering  rules 
will  be  applied  by  the  filtering  engine  if  the  current  rule  is  not  a  match.  If  this  bit  is 
not  set,  subsequent  filtering  rules  will  not  be  applied  no  matter  the  current  rule  is  a 
match  or  not.  When  the  Sampling  bit  (bit  46)  is  set,  traffic  sampling  and  logging  for 
this  flow  is  enabled. 

The  third  type  is  called  redirect  extended  community,  which  allows  the  traffic  to 
be  redirected  to  a  VRF  routing  instance.  Its  6-byte  value  encodes  the  route  target. 

The  fourth  type  is  called  traffic  marking  extended  community,  which  instructs  a 
system  to  modify  the  DSCP  bits  of  a  transiting  IP  packet  to  the  corresponding  value. 
This  extended  community  value  is  encoded  as  a  sequence  of  5  zero  bytes  followed 
by  the  DSCP  value  encoded  in  the  6  least  significant  bits  of  the  6th  byte. 

Although  the  above  extended  community  values  have  been  defined,  the  stan¬ 
dard  specification  leaves  it  open  for  each  unique  implementation  to  use  arbitrary  ex¬ 
tended  community  values  for  various  additional  filtering  actions,  as  heterogeneous 
networks  and  routing  devices  make  it  difficult  to  standardize  on  all  possible  filtering 
actions. 


2.2  Factors  of  Router-based  Filtering 

It  is  expected  that  the  router-based  early  filtering  approach  can  filter  the  malicious 
unwanted  traffic  early  and  so  can  benefit  the  legitimate  traffic  with  reduced  latency 
and  increased  throughput.  However,  the  benefits  are  not  always  guaranteed  along 
with  the  deployment.  This  is  mainly  due  to  the  cost  of  filtering:  a  routers  action  of 
filtering  packets  against  a  rule  will  incur  extra  packet  processing  overhead,  and  all 
the  packets,  no  matter  unwanted  packets  or  legitimate  packets,  need  to  be  matched 
against  the  installed  rules.  In  this  subsection,  we  discuss  the  factors  that  need  to  be 
considered  to  ensure  the  effectiveness  and  benefits  of  the  router-based  early  filtering 
approach. 
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1.  Degree  of  Rule  Dissemination:  Many  unwanted  packets  can  be  attributed  to 
DDoS  attacks  launched  by  botnets,  in  which  the  sources  are  highly  distributed. 
In  this  case,  if  a  corresponding  filtering  rule  is  disseminated  to  more  routers, 
then  there  is  a  better  chance  to  filter  more  unwanted  traffic  at  an  earlier  time. 
We  regard  the  number  of  hops  for  which  a  rule  is  disseminated  as  its  degree  of 
dissemination.  Depending  on  the  type  of  attack,  the  rule  can  be  disseminated  at 
each  hop  to  only  one  specific  upstream  router  fif  the  offending  source  is  already 
known)  or  to  every  neighboring  router  except  the  one  from  which  the  rule  is 
received.  However,  higher  degree  of  rule  dissemination  also  increases  the  overall 
filtering  overhead. 

2.  Filtering  Overhead:  Although  filtering  unwanted  traffic  early  can  save  band¬ 
width  for  more  efficient  network  utilization  by  legitimate  traffic,  this  benefit  does 
not  come  for  free.  The  action  of  filtering  itself  will  consume  the  computation  re¬ 
source  of  a  router,  which  will  slow  down  the  routers  packet  processing  to  some 
degree.  Therefore,  to  ensure  that  there  is  a  considerable  net  gain,  it  is  necessary 
to  consider  the  tradeoffs  between  the  bandwidth  saved  from  the  filtered  unwanted 
traffic  and  the  overhead  resulting  from  early  filtering.  Assume  that  the  size  of  the 
attacking  botnet  (i.e.  the  number  of  participating  bots  in  the  botnet)  is  A,  and 
each  participating  bot  contributes  traffic  of  average  bandwidth  u  to  the  victim. 
If  the  filtering  rule  is  disseminated  outbound  for  h  hops  in  a  breadth-first  man¬ 
ner,  the  overall  bandwidth  we  can  save  by  filtering  the  attack  traffic  is  roughly 
h-N  -u  (assuming  the  sources  are  at  least  h  hops  away  from  the  victim).  On  the 
other  hand,  the  use  of  filtering  rules  will  lead  to  a  slowdown  in  packet  process¬ 
ing.  Assume  that  the  rule  corresponding  to  the  attack  is  installed  at  i  routers,  and 
the  overhead  of  filtering  against  one  rule  will  result  in  an  average  reduction  of  th 
in  throughput.  Thus,  the  cost  for  achieving  the  above  saving  is  estimated  at  i  ■  th. 
The  goal  is  that  the  saving  is  larger  than  the  cost,  which  is  very  likely  considering 
the  large  size  of  botnets  and  the  ever  growing  processing  speed  and  capabilities 
of  routers. 

3.  Size  of  Rule  Cache  and  Replacement  of  an  Old  Rule:  It  is  desirable  to  limit 
the  filtering  overhead  on  each  router  to  an  acceptable  range.  Besides  limiting  the 
degree  of  rule  dissemination,  we  can  also  specify  the  maximum  number  of  rules 
each  router  installs.  The  installed  rules  can  be  maintained  in  a  rule  cache.  When 
a  new  filtering  rule  is  received  and  the  rule  cache  is  full,  the  router  has  to  make  a 
decision  about  whether  to  drop  the  newly  received  rule  or  to  replace  an  old  rule 
with  the  newly  received  rule. 

4.  Revocation  of  Obsolete  Rules:  A  filtering  rule  will  not  stay  valid  forever  be¬ 
cause  most  offending  unwanted  traffic  will  eventually  stop.  It  is  desirable  to  re¬ 
voke  obsolete  rules  in  order  to  save  filtering  overhead.  We  can  revoke  obsolete 
rules  with  two  methods.  First,  when  the  victim  of  the  unwanted  traffic  detects  that 
the  attack  has  stopped,  it  can  send  out  a  revocation  rule  to  notify  the  routers  to  re¬ 
voke  the  corresponding  obsolete  rule.  The  revocation  rule  must  be  authenticated 
to  prevent  an  attacker  from  spoofing  a  request  to  revoke  an  effective  filtering  rule. 
Second,  it  is  also  possible  to  include  a  lifetime  or  expiration  time  in  the  filtering 
rule,  so  that  the  routers  know  when  to  revoke  a  rule. 
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5.  Location  of  Legitimate  Sources:  For  a  router,  installing  a  filtering  rule  spells  for 
more  processing  overhead.  Therefore,  the  dissemination  of  filtering  rule  should 
not  continue  without  a  bound.  When  considering  how  far  should  a  filtering  rule 
be  disseminated,  one  important  observation  is  that  at  any  given  time  a  host  only 
communicates  with  a  certain  number  of  legitimate  hosts.  Assume  that  the  victim 
host  in  question  is  communicating  with  n  other  legitimate  hosts,  and  the  number 
of  hops  from  the  victim  to  each  of  these  hosts  is  hi,h2,  ■  ■  ■  ,hn.  We  can  call  the 
largest  value  among  hi,h2,---,hn  as  the  communication  radius  for  the  victim.  If 
the  filtering  rule  is  disseminated  beyond  the  victims  radius  of  communication,  the 
legitimate  sources  will  not  get  additional  benefit,  because  the  links  over  which  the 
bandwidth  is  saved  due  to  the  extra  dissemination  are  not  on  the  paths  from  the 
legitimate  sources  to  the  victim.  If  a  victim  knows  that  it  will  only  communicate 
with  legitimate  hosts  within  communication  radius  r,  then  the  victim  can  include 
r  in  the  filtering  rule  it  sends  out,  so  that  the  routers  will  not  to  disseminate  the 
rule  beyond  r  hops  from  the  victim. 


3  Selecting  Appropriate  Routers  to  Install  Filtering  Rules 

One  important  problem  that  remains  to  be  addressed  is  how  to  determine  which 
set  of  routers  in  the  network  should  be  selected  to  install  a  given  filtering  rule  of 
malicious  traffic  in  order  to  maximize  the  benefit  of  early  filtering.  There  are  at  least 
two  reasons  which  make  this  problem  important.  First,  in  our  previous  work  [8]  we 
had  shown  that  more  router  bandwidth  can  be  saved  by  forwarding  filtering  rules 
to  the  routers  closer  to  the  attack  sources.  This  is  straightforward  to  realize  if  all 
the  malicious  traffic  converges  on  the  same  single  path,  because  moving  away  from 
the  victim  on  the  path  means  moving  toward  attack  sources.  However,  in  todays 
Internet  the  attacks  and  threats  often  originate  from  very  distributed  sources,  which 
largely  complicate  this  problem.  Second,  for  consideration  of  router  efficiency  it  is 
not  possible  to  install  unlimited  number  of  filtering  rules  in  a  router  because  the 
overhead  of  matching  every  received  packet  against  every  filtering  rule  will  be  too 
large.  To  meet  this  constraint,  it  is  desirable  to  install  a  filtering  rule  only  in  a  chosen 
subset  of  routers  instead  of  in  all  routers  in  the  network. 

In  this  section,  we  formally  formulate  the  problem  of  automatic  selection  of 
routers  for  filter  placement,  and  present  an  analytical  solution  to  this  problem  which 
was  also  introduced  in  [7],  The  proposed  algorithm  is  based  on  dynamic  program¬ 
ming.  We  start  with  the  simple  case  of  tree  topology  in  which  there  is  only  one  path 
from  each  node  to  the  root,  and  then  generalize  it  to  cyclic  topology.  We  also  dis¬ 
cuss  how  to  apply  our  early  filtering  approach  and  the  QoS  based  routing  with  traffic 
distribution  in  the  networks. 

We  introduce  a  dynamic  programming  based  algorithm  to  solve  this  problem  of 
router  selection.  In  our  network  model,  we  assume  that  the  routing  paths  from  the 
distributed  attackers  to  the  target  server  (the  victim)  form  an  acyclic  tree  topology, 
in  which  there  is  only  one  path  from  each  attacker  to  the  victim.  (We  begin  our 
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discussion  with  the  tree  topology  for  the  sake  of  simplicity.  In  a  later  section,  we 
will  discuss  an  extension  of  our  model  and  algorithm  to  cover  cyclic  topologies.) 
The  root  of  the  tree  is  the  victim,  and  other  nodes  on  the  tree  represent  intermediate 
routers.  Therefore,  the  network  can  be  modeled  as  a  tree  T  =  (V,E),  where  V  is  a  set 
of  routers  and  £  is  a  set  of  links  connecting  adjacent  routers  in  V.  There  may  be  a 
number  of  hosts,  legitimate  or  malicious,  connected  to  each  router.  Malicious  hosts 
can  send  malicious  traffic  toward  the  victim.  In  our  network  model,  our  concern 
is  focused  on  the  routers,  not  the  hosts,  because  all  the  malicious  traffic  flows  are 
forwarded  by  the  routers  toward  the  victim.  Without  loss  of  generality,  we  consider 
one  routing  path  at  a  time. 

An  example  of  our  network  model  is  shown  in  Fig.  2.  In  this  example,  if  router 
A3  has  malicious  traffic  destined  to  the  victim  Rq,  then  /A,  will  forward  the  malicious 
traffic  to  R2,  which  is  the  next  router  on  the  path  from  A3  to  Rq.  Next,  A  2  will  forward 
the  malicious  traffic  received  from  A3,  along  with  the  malicious  traffic  received  from 
malicious  hosts  connected  to  A?  (if  any),  to  router  Ai.  Finally,  Ai  will  forward  the 
malicious  traffic  received  from  A?,  along  with  the  malicious  traffic  received  from 
malicious  hosts  connected  to  Ai  (if  any),  to  Ao.  The  goal  of  our  approach  is  to 
install  filtering  rules  at  a  set  of  appropriate  routers  in  the  network,  such  that  the 
malicious  traffic  flows  are  blocked  early,  soon  after  their  packets  enter  the  network. 
This  approach  has  at  least  two  significant  benefits.  First,  it  can  save  bandwidth  of 
the  routers  and  links  from  being  occupied  by  malicious  traffic,  and  make  the  saved 
bandwidth  available  for  the  use  of  legitimate  traffic.  Second,  it  can  confine  other 
possible  damages  that  could  be  caused  by  the  malicious  traffic  (e.g.,  slowing  down 
the  hosts  or  routers,  modifying  or  deleting  files  in  the  hosts  or  routers)  to  local 
regions.  Fig.  3  illustrates  the  case  in  which  some  malicious  traffic  flows  are  blocked 
at  routers  where  the  corresponding  filtering  rule  is  installed  (represented  by  a  gray 
circle).  In  this  example,  since  router  A3  installs  the  corresponding  filtering  rule,  A3 
will  block  the  malicious  traffic  it  receives.  As  such,  there  will  be  no  malicious  traffic 
flowing  from  router  A3  to  router  A2.  Router  At  does  not  installs  the  corresponding 
filtering  rule,  so  A2  will  forward  the  malicious  traffic  it  receives  toward  router  R\. 
Router  A]  installs  the  corresponding  filtering  rule,  so  Ai  will  block  the  malicious 
traffic  it  receives,  and  there  will  be  no  malicious  traffic  flowing  from  router  Ai  to 
the  victim  Rq. 

To  formulate  the  problem,  we  have  to  first  define  the  saving  and  cost  resulting 
from  filtering  a  malicious  traffic  flow  at  a  router  u.  It  is  apparent  to  see  that  there  is 
a  miss  penalty  if  one  specific  malicious  traffic  flow  is  not  filtered  at  a  given  router 
u.  The  miss  penalty  includes  at  least  the  following  two  components:  (1)  the  band¬ 
width  (of  the  link  to  the  next  router  on  the  routing  path  to  the  victim)  that  will  be 
occupied  by  the  malicious  flow  if  not  filtered,  (2)  the  cost  of  the  threat  or  damage 
caused  by  the  malicious  flow  if  not  filtered.  The  first  component  is  apparently  a  nu¬ 
meric  value  which  could  be  normalized  according  to  the  capability  of  the  router  and 
the  capacity  of  the  link,  while  the  second  component  needs  to  be  quantified  using 
some  predefined  metrics.  Let  m(u,fi)  be  the  miss  penalty  of  allowing  one  unit  of 
malicious  traffic  flow  f)  (e.g.,  one  individual  packet  of  the  malicious  flow,  or  a  set 
of  consecutive  packets  that  can  cause  damage  when  allowed  to  pass)  at  router  u.  We 
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Fig.  2  The  network  model. 


malicious 
traffic  flow 


Fig.  3  Malicious  traffic  flows  are  blocked  at  some  routers  where  the  corresponding  filtering  rule  is 
installed. 


can  define  m(u,fj)  as 

m(^fi)  =  L(Ul,u2)ePATH(u,u')C(U  1>“2.^) 

where  u'  is  the  nearest  node  on  the  path  from  u  to  the  root  that  installs  the  filtering 
rule  on  /,,  and  u\,  112  are  adjacent  nodes  on  the  path  from  u  to  u'  and  (u  \  .1,12)  £  E, 
and  c(m,U2,fi)  is  a  function  that  will  output  the  penalty  of  not  filtering  one  unit  of 
fi  on  the  link  from  u\  to  U2- 
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Next,  we  analyze  the  tradeoff  between  the  saving  and  cost  when  early  filtering  is 
exerted.  The  bandwidth  saved  at  a  router  due  to  filtering  is  related  to  the  packet  rate 
of  the  malicious  traffic  flow  arriving  at  the  router.  If  the  rate  of  a  malicious  traffic 
flow  is  higher,  then  the  bandwidth  saved  by  filtering  this  flow  is  higher.  On  the  other 
hand,  installing  a  filtering  rule  at  a  router  imposes  processing  overhead  on  the  router 
because  the  router  will  have  to  match  the  filtering  rule  against  every  received  packet 
to  determine  whether  the  packet  should  be  allowed  or  blocked.  (Even  if  a  received 
packet  is  legitimate,  it  will  be  matched  against  every  filtering  rule  so  that  it  can  be 
determined  to  be  legitimate.)  This  overhead  will  negatively  impact  the  throughput 
at  the  router.  Let  r(u,fi)  be  the  rate  of  malicious  flow  j)  observed  at  router  u,  and 
o(u,fj)  be  the  amount  of  reduced  throughput  due  to  the  overhead  of  installing  the 
filtering  rule  on  f,  at  router  u.  Then  the  saving  of  filtering  j)  at  router  u  can  be 
derived  as 

In  addition,  there  is  one  possible  source  of  performance  loss  in  the  throughput. 
We  note  that  due  to  the  overhead  of  filtering  there  should  be  an  upper  bound  on  the 
number  of  installed  filtering  rules  at  each  router,  because  if  a  router  installs  unlimited 
number  of  filtering  rules  then  it  may  have  to  spend  too  much  of  its  computation 
resources  on  filtering,  thus  leading  to  long  delay  and  low  throughput.  If  a  router 
wants  to  install  a  new  rule  but  the  number  of  installed  filtering  rules  at  the  router 
has  reached  the  upper  bound,  then  one  existing  rule  should  be  removed  in  order  to 
make  space  for  installing  this  new  rule.  We  assume  that  there  is  a  heuristic  algorithm 
available  for  a  router  to  select  the  rule  that  generates  the  least  saving  to  replace.  In 
particular,  let  l(u.  fi)  be  the  performance  loss  in  the  throughput  due  to  replacing  the 
least-saving  rule  at  router  u,  then  l(u,fi)  can  be  derived  as 


l  -I i)  —  min  every  fj  currently  filtered  at  fj)  '  m(u.  fj  )  } 

It  is  clear  that  the  rule  selected  to  be  replaced  also  imposes  processing  overhead 
on  router  u,  whose  value  is  the  same  as  o{u,fi)  because  this  new  rule  also  needs  to 
be  matched  against  every  received  packet.  Therefore,  the  net  saving  of  filtering  j)  at 
router  u  can  be  derived  as 

r(u,fi)-m{u,fi)-o(u,fi)-(l{u,fi)-o{u,fi))  =  r(u,fi)-m{u,fj)-l{u,fj) 

Now  we  can  formulate  the  problem.  For  simplicity  reason  without  loss  of  gen¬ 
erality,  we  limit  our  consideration  to  only  one  routing  path  in  the  network  and  one 
type  of  malicious  traffic.  (Note  that  other  routing  paths  in  the  network  and  other 
types  of  malicious  traffic  can  be  formulated  in  the  same  way.)  Suppose  one  routing 
path  of  the  malicious  traffic  contains  n  routers  on  the  path,  R\  through  Rn,  and  Rq 
is  the  victim  of  the  malicious  traffic.  The  filtering  rule  corresponding  to  the  given 
malicious  traffic  can  be  installed  in  up  to  k  routers,  where  0  <k<n,  on  the  routing 
path.  This  leads  to  an  optimization  problem  on  figuring  out  which  routers  (up  to  k) 
should  install  the  filtering  rule  in  order  to  maximize  the  saving. 
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Let  r,  be  the  rate  of  the  malicious  traffic  observed  at  router  Rt,  and  m,  be  the 
miss  penalty  of  the  malicious  traffic  at  router  Rj.  Since  every  intermediate  router 
may  possibly  be  connected  to  malicious  hosts  that  will  inject  more  malicious  traffic 
toward  the  victim  Rq,  we  can  get  r\  >  n>  ...  >  rn,  which  means  a  router  closer 
to  Rq  will  observe  higher  rate  of  the  malicious  traffic.  Assume  x  <  k  intermediate 
routers  RVl,  RV2,  RVx,  where  vi,  \’2,  ■  ■  ■ .  vx  are  integers  and  1  <  v\  <  V2  <  . . .  < 
vx  <  k,  are  chosen  to  install  the  filtering  rule.  Since  RVj+ \  would  have  filtered  rate 
rVj+ 1  of  the  malicious  traffic,  so  the  rate  of  malicious  traffic  to  be  observed  and 
filtered  by  Rv.  is  only  rVj  —  rVj+ 1 .  Also,  let  /,  be  the  performance  loss  due  to  replacing 
an  existing  filtering  rule  at  router  R,.  Thus  the  overall  saving  achieved  by  filtering  at 
the  aforementioned x routers  RVl ,  RVl, RVx  is  given  by  the  following  summation: 

L((rv.--rvi+ i)-mVi-lVi) 
i=  1 

Therefore,  the  problem  can  be  formally  defined  as  follows. 

Definition  1.  The  filter  placement  ^-optimization  problem:  Given  Rq.  Ri,  /L, 

which  are  nodes  on  a  tree  T  =  (V,E)  and  R,  is  the  parent  of  Rl+\  on  the  tree 
T,  and  associated  parameters  r \,  ro, . . . ,  r„,  r„+i,  where  r\  >  ri  >  . . .  >  rn  >  rn+  \  = 
0,  m i,  m2,  . . . ,  m„,  h,  h,  . .  • ,  /«,  and  k  be  an  integer  where  0  <  k  <  n.  Compute  x, 
where  0  <x<k,  and  a  set  of  x  integers  vi ,  V2, . . . ,  vx  where  1  <  vi  <  V2  <  . . .  <  vx  <  k 
(and  vo  =  0),  such  that  the  following  objective  function  is  maximized: 

X  Vi 

ASAV(Rn,k:vuv2,...,vx)  =  '£((rVi-  £  mj)-lVi) 

i=  i  ;=v,-i+i 

In  its  general  form,  ASAV ( Rind,k )  represents  the  net  saving  of  the  case  when  up 
to  k  of  routers  R\  through  Rm<i  can  install  filtering.  Also,  define  ASAV  (Rjmi.O)  =  0. 

We  propose  to  use  a  dynamic  programming  based  algorithm  to  solve  this  k- 
optimization  problem.  Dynamic  programming  is  a  well-known  method  which  solves 
complex  problems  by  breaking  it  into  smaller  overlapping  subproblems.  To  achieve 
this,  a  recurrence  is  needed  to  link  the  original  problem  with  its  subproblems. 

The  recurrence  relation  and  initial  condition  required  by  the  dynamic  program¬ 
ming  method  are  given  as  follows: 

{max(maxi<i<ind (A SAV ( Rt,k  -  1)  +  (. rind  ■  Hf£Vx+i  mj )  ~  hnd), 
ASAV (Rind-i,k)),  if  ind  >  k; 

ASAV (Rjnij,  ind ),  if  ind  <  k 

where  the  vx  in  the  first  case  is  the  last  router  (the  one  with  largest  index)  found  in 
the  solution  to  ASAV(Ri,k —  1). 
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3.1  Resolving  Cycles  in  Network  Topology 

In  the  previous  discussion,  it  is  assumed  that  the  network  topology  is  tree-shaped 
for  the  sake  of  simplicity.  However,  in  reality  a  network  topology  often  contains  one 
or  more  redundant  paths  between  two  routers,  for  several  reasons  such  as  reliability 
and  load  balancing.  The  undirected  edges  representing  these  paths  form  cycles  in 
the  topology. 

There  is  one  incompatibility  between  cyclic  topology  and  our  algorithm.  When 
the  malicious  traffic  reaches  one  router  on  the  cycle,  the  malicious  traffic  may  split 
into  two  portions,  each  following  one  different  path  toward  the  victim.  This  violates 
our  basic  assumption  that  a  router  closer  to  the  root  of  the  tree  (in  this  case  the 
victim)  will  observe  higher  rate  of  the  malicious  traffic  due  to  convergence. 

Our  algorithm  can  be  extended  with  the  following  operation  to  resolve  the  cycles 
in  the  topology.  If  a  cycle  is  found  in  the  topology,  we  can  find  the  router  at  the  split 
point  where  the  malicious  traffic  splits  into  two  paths,  and  install  the  filtering  rule 
in  that  router.  Therefore,  no  malicious  traffic  will  flow  from  the  router  at  the  split 
point  to  the  next  router  on  each  of  the  two  paths.  In  this  case,  we  can  temporarily 
remove  the  two  edges  connecting  the  router  at  the  split  point  and  the  next  router  on 
each  of  the  two  paths,  and  divide  the  topology  into  two  subgraphs:  one  subgraph  is 
rooted  at  the  victim  and  the  other  is  rooted  at  the  router  at  the  split  point.  With  this 
operation,  the  assumption  that  a  router  closer  to  the  root  will  observe  higher  rate  of 
the  malicious  traffic  is  again  satisfied.  This  operation  can  be  applied  multiple  times 
until  all  the  cycles  in  the  topology  are  resolved.  Then,  our  algorithm  can  be  applied 
on  each  subgraph  separately. 

We  can  use  the  example  topology  in  Fig.  4  to  illustrate  this  extension.  Suppose  Rq 
is  the  victim  and  the  sequence  (Rq,  /(?,  R4,  AQ  AS,  R 1,  AS)  forms  a  cycle.  When  the 
malicious  traffic  reaches  R$,  it  will  split  into  two  paths  AS  AS  AS  AS  and  AS  AS  AS  Ro¬ 
lf  the  filtering  rule  is  installed  at  AS,  then  there  will  be  no  malicious  traffic  flowing 
from  AS  to  AS  and  from  AS  to  AS.  Therefore,  we  can  divide  the  topology  into  two 
subgraphs,  one  rooted  at  Rq  and  the  other  one  rooted  at  AS,  and  each  subgraph 
satisfies  the  basic  assumption. 


3.2  Integrating  Router-based  Early  Filtering  and  Cloud  Auditing 
with  QoS  Routing  Algorithms 

It  has  been  shown  in  [8]  that  our  router-based  early  filtering  approach  can  reduce 
the  average  transmission  latency  and  increase  the  average  throughput  of  legitimate 
traffic.  The  dynamic  programming  based  filter  placement  algorithm  presented  above 
can  find  an  appropriate  set  of  routers  automatically  to  install  the  filtering  rule  to  max¬ 
imize  the  benefit  of  early  filtering.  These  improved  values  of  transmission  latency 
and  average  throughput  can  be  used  as  feedback  to  QoS  routing  algorithms.  QoS 
routing  is  a  well  studied  topic  and  was  proposed  to  provide  guaranteed  services  by 
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Fig.  4  Configuration  of  testbed  network. 


finding  a  path  satisfying  the  QoS  requirements  of  the  traffic.  There  are  existing  QoS 
routing  algorithms  based  on  dynamic  programming,  graph  theory,  and  probability, 
such  as  [3],  [6],  and  [14]. 

In  general,  QoS  routing  algorithms  depend  on  the  link  state  information,  such 
as  available  bandwidth,  transmission  latency,  loss  probability,  and  cost  function,  to 
make  routing  decisions  that  satisfy  QoS  requirements  of  specific  traffic.  Since  the 
network  environment  changes  as  time  goes  by,  the  link  state  information  which  was 
given  as  the  initial  input  to  the  QoS  routing  algorithm  may  not  persist. 

The  router-based  early  filtering  approach  and  cloud  auditing  techniques  can  be 
combined  to  improve  the  performance  of  QoS  routing  algorithms.  Cloud  auditing 
techniques  are  usually  used  to  monitor  the  service  level  and  the  usage  of  cloud 
resources,  and  analyze  the  interaction  between  cloud  servers  and  clients.  If  the  early 
filtering  approach  is  applied,  the  shorter  transmission  latency  and  higher  average 
throughput  resulting  from  the  saved  bandwidth  will  be  observed  and  recorded  by 
cloud  auditing  tools.  When  the  improved  values  of  transmission  latency  and  average 
throughput  are  provided  to  QoS  algorithms,  they  can  make  more  precise  routing 
decisions  that  utilize  the  benefits  generated  by  the  early  filtering  approach. 


4  Experimental  Results 

We  design  a  set  of  experiments  to  evaluate  the  performance  and  benefits  of  our 
approach  introduced  in  Section  3.  The  steps  we  follow  to  conduct  the  experiments 
are  summarized  as  follows: 

Step  1.  Set  up  a  testbed  network  which  is  composed  of  a  single  path  of  routers, 
one  host  (playing  the  role  of  the  distributed  attacker)  connected  to  each  router,  and 
the  destination  (the  victim). 
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Step  2.  Define  the  parameters  for  the  experiment,  including  77,  m,,  and  /,  at  router 
Ri. 

Step  3.  Inject  the  target  attacking  traffic  and  the  secondary  attacking  traffic  whose 
volume  meet  the  defined  parameters. 

Step  4.  By  default,  install  in  each  router  the  filtering  rule  corresponding  to  the 
secondary  attacking  traffic  that  directly  comes  into  each  router  instead  of  being  for¬ 
warded  by  another  router. 

Step  5.  Do  the  following  for  each  possible  combination  of  routers:  for  each  router 
in  the  chosen  combination,  replace  the  filtering  rule  on  the  secondary  attacking  traf¬ 
fic  with  the  filtering  rule  on  the  target  attacking  traffic;  generate  the  legitimate  traffic 
toward  the  destination  and  measure  the  throughput  of  the  legitimate  traffic. 

Next,  we  describe  each  step  of  our  simulation  in  detail.  In  Step  1,  with  our  cur¬ 
rently  available  equipment,  we  set  up  a  testbed  as  illustrated  in  Fig.  5.  The  configura¬ 
tion  of  our  testbed  is  composed  of  a  single  path  of  4  Cisco  2911  routers,  with  a  host 
connected  to  each  router  to  simulate  distributed  attackers.  The  attackers  are  from  the 
addresses  192.168.1.2,  192.168.2.2,  192.168.3.2,  and  192.168.4.2  as  shown  in  the 
figure.  All  the  malicious  traffic  flows  are  destined  to  the  victim  host  located  at  the 
address  192.168.5.2.  To  measure  the  performance  of  our  algorithm,  we  also  connect 
a  host  with  the  address  192.168.6.2  to  the  last  router  to  send  legitimate  traffic  to  the 
victim  host. 

In  Step  2,  we  define  the  parameters  for  the  experiment.  Since  there  are  4  routers 
on  the  path  in  the  testbed,  we  call  the  4  routers  R4 ,  AS,  /\’|  respectively,  while 

Ro  represents  the  destination.  We  define  the  relevant  parameters  7-4,  7-3,  7-3,  r\,  7774, 
7773,  7773,  mi,  h,  h,  h,  h,  for  the  simulation  example  as  shown  in  Fig.  6.  Note  that 
the  value  of  each  77  must  be  at  least  equal  to  the  value  of  77  +  1 ;  this  is  because 
router  Rj  may  receive  the  attacking  traffic  from  R;  +  1  and  also  from  the  attacker 
directly  connected  to  R,  itself.  Also  note  that  we  normalize  the  value  of  all  777,  to  1 
in  this  particular  simulation  because  all  the  attacking  traffic  is  ping  packets  and  all 
the  routers  used  in  our  testbed  are  of  the  same  model. 

In  Step  3,  we  need  to  inject  the  target  attacking  traffic  and  the  secondary  attacking 
traffic  which  meet  the  parameters  defined  in  Step  2.  The  target  attacking  traffic  sim¬ 
ulates  the  traffic  which  corresponds  to  the  new  filtering  rule,  therefore  the  volume 
of  the  target  attacking  traffic  observed  by  router  Rj  should  be  in  proportion  to  the 
7',  parameter.  The  secondary  attacking  traffic  simulates  the  traffic  filtered  by  another 
filtering  rule  which  will  be  selected  to  be  replaced  by  the  new  filtering  rule,  therefore 
the  volume  of  the  secondary  attacking  traffic  should  be  proportion  to  the  /,  param¬ 
eter.  The  target  attacking  traffic  is  simulated  by  flooding  ping  from  each  attacking 
host  toward  the  victim  host  with  multiple  threads  of  the  following  command: 

ping  -f  -s  1024  192.168.5.2  & 

Note  that  the  ping  command  is  executed  in  the  background  (by  specifying  the 
“&”  at  the  end),  so  that  we  can  execute  different  number  of  threads  of  ping  at  dif¬ 
ferent  attacking  hosts  to  simulate  different  volume  of  attacking  traffic  coming  into 
each  router. 
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Fig.  5  Configuration  of  testbed  network. 


r4=0.6  r3  =  0.8  r2  =  1.6  r1  =  2.2 

m4  =  1  m3  =  1  m2  =  1  m-L  =  1 

/4=1  /3  =  0.6  l2  =  0.8  /4=1 

Fig.  6  Parameters  used  in  the  simulation  experiments. 

Moreover,  in  order  to  simulate  namely  the  performance  loss  due  to  replacing 
an  existing  filtering  rule  at  router  we  need  to  simulate  a  secondary  attacking  traf¬ 
fic  coming  into  each  router  and  being  filtered  by  a  different  rule.  When  one  router 
installs  the  filtering  rule  on  the  target  attacking  traffic,  the  other  rule  filtering  the 
secondary  attacking  traffic  is  disabled,  so  as  to  simulate  the  performance  loss  due 
to  installing  the  new  rule.  It  would  be  ideal  if  we  can  connect  a  second  host  to 
each  router  to  generate  the  secondary  attacking  traffic,  so  that  the  filtering  of  the 
secondary  attacking  traffic  will  be  based  on  source  address,  and  the  comparison  of 
the  impact  on  legitimate  traffics  throughput  due  to  the  target  attacking  traffic  and 
the  impact  due  to  the  secondary  attacking  traffic  will  be  fair.  However,  because  of 
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the  constraint  of  our  currently  available  equipment,  we  have  to  use  an  alternative 
approach  in  which  we  distinguish  the  target  attacking  traffic  from  the  secondary  at¬ 
tacking  traffic  by  using  the  Type  of  Service  (TOS)  value  [2],  which  can  be  specified 
as  one  optional  field  in  the  ping  packet  and  can  also  be  specified  in  a  filtering  rule. 
For  example,  to  flood  ping  packets  with  TOS  value  4  toward  the  victim  192.168.5.2, 
we  can  execute  multiple  threads  of  the  following  command: 

ping  -f  -s  1024  Q  4  192.168.5.2  & 

We  generated  the  target  attacking  traffic  which  traverses  all  4  routers  as  type  0 
(the  normal  type),  the  secondary  traffic  coming  into  R  i  as  type  4,  the  secondary 
traffic  coming  into  R2  as  type  8,  the  secondary  traffic  coming  into  R3  as  type  12,  and 
the  secondary  traffic  coming  into  R4  as  type  16. 

Based  on  the  parameters  defined  in  Fig.  6,  we  execute  60  threads  of  ping  at  the 
host  192.168.4.2,  20  threads  of  ping  at  the  host  192.168.3.2,  80  threads  of  ping  at  the 
host  192.168.2.2,  and  60  threads  of  ping  at  the  host  192.168.1.2,  so  that  the  volume 
of  target  attacking  traffic  injected  into  each  router  is  proportional  to  r 4,  n,  ro,  and 
r\ .  We  also  execute  appropriate  number  of  ping  threads  with  TOS  assigned  so  that 
the  volume  of  secondary  attacking  traffic  injected  into  each  router  is  proportional  to 
I4,  (3,  h ,  and  l\. 

In  Step  4,  by  default,  each  router  installed  the  following  set  of  ACL  rules  to  filter 
the  secondary  attacking  traffic: 

access-list  100  deny  icmp  any  any  tos  X  {X  is  a  chosen 
tos  value} 

access-list  100  permit  icmp  any  any 

access-list  100  permit  ip  host  192.168.6.2  any 

If  a  router  is  chosen  to  install  the  filtering  rule  on  the  target  attacking  traffic  in 
Step  5,  the  router  will  disable  the  above  rules  and  install  the  following  set  of  ACL 
rules: 

access-list  101  deny  icmp  any  any  tos  normal 

access-list  101  permit  icmp  any  any 

access-list  101  permit  ip  host  192.168.6.2  any 

In  Step  5,  we  generate  legitimate  traffic  and  test  it  with  each  possible  combination 
of  routers  installing  the  filtering  rule  in  order  to  measure  the  throughput  of  legitimate 
traffic  of  each  case.  The  legitimate  traffic  is  generated  by  executing  iperf  [11]  client 
procedure  on  the  host  192.168.6.2  which  is  connected  to  router  R4 ,  and  executing 
iperf  server  procedure  on  the  victim  host  192.168.5.2.  The  iperf  tool  can  measure  the 
amount  of  data  transferred  between  the  client  and  the  server  during  a  given  period 
of  time,  and  calculate  the  throughput  of  the  data  path  accordingly. 

We  test  all  possible  combinations  of  routers  installing  the  filtering  rule.  There 
are  totally  15  combinations  to  test,  because  there  are  4  routers  on  the  path  and  each 
router  can  be  included  or  excluded  in  the  combination  except  for  the  case  in  which 
no  router  installs  the  filtering  rule.  The  effectiveness  of  our  algorithm  will  be  val- 
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idated  if  a  case  with  higher  net  saving  ASAV(Rjnci,k)  calculated  by  the  proposed 
algorithm  has  higher  throughput  of  legitimate  traffic.  Table  1  shows  the  solution  de¬ 
rived  from  the  dynamic  programming  based  algorithm,  including  the  best  net  saving 
and  the  set  of  routers  that  should  install  the  filtering  rule  to  get  the  best  net  saving. 
Table  2  shows  a  list  of  the  net  saving  calculated  by  the  algorithm  and  the  measured 
throughput  of  legitimate  traffic  in  the  simulation  for  all  possible  combinations  of 
routers  installing  the  filtering  rule. 


K 

0 

1 

2 

3 

4 

Ri 

0 

1.2 

{*1} 

1.2 

{«i} 

1.2 

{Ri} 

1.2 

{Ri} 

r2 

0 

2.4 

{*2} 

2.4 

m 

2.4 

{R2} 

2.4 

{R2} 

r3 

0 

2.4 

{*2} 

2.6 

{R2,Ri} 

2.6 

{R2R3} 

2.6 

{R2,Ri} 

r4 

0 

2.4 

{«2> 

2.6 

{R2,Ri} 

2.6 

{«2,«3} 

2.6 

{R2,R3} 

Table  1  Solution  to  the  simulation  example  using  the  dynamic  programming  based  algorithm. 


Note  that  the  first  row  in  Table  2,  which  states  “No  filtering”,  refers  to  the  con¬ 
trast  case  where  all  four  routers  continue  to  filter  only  the  secondary  attacking  traffic 
and  do  not  install  the  filtering  rule  against  the  target  attacking  traffic.  From  Table 
2,  it  can  be  seen  that  the  measured  throughput  is  improved  when  the  filtering  rule 
against  the  target  attacking  traffic  is  installed  on  any  subset  of  the  four  routers  in 
our  testbed.  The  measured  throughput  of  the  legitimate  traffic  is  largely  consistent 
with  the  net  saving  calculated  by  the  proposed  algorithm.  In  particular,  the  results 
show  that  the  two  cases  that  lead  to  highest  net  saving  are  installing  the  filtering  rule 
at  routers  Rj  and  R2  or  installing  at  routers  Rj  and  /?4,  which  is  consistent  with  the 
results  calculated  by  the  algorithm. 


5  Conclusion 

In  this  chapter,  we  present  a  router-based  filtering  technology  to  address  the  avail¬ 
ability  issues  in  cloud  computing.  We  first  introduced  the  BGP  Flow-Spec  rules, 
which  provide  a  handy  means  for  routers  to  exchange  information  about  early  filter¬ 
ing  of  overwhelming  malicious  traffic  that  could  impact  the  availability  of  cloud  ser¬ 
vices.  Then,  we  presented  a  theoretical  model  and  a  dynamic  programming  based  al¬ 
gorithm  to  find  the  best  routers  in  a  network  to  install  the  filtering  rule  against  given 
malicious  traffic  flow,  and  discussed  how  to  integrate  this  algorithm  with  cloud  au- 
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Combinations  of 

Net  Saving 

Throughput  Measured 

Routers  Installing 

Calculated  by 

by  Simulations 

the  Filtering  Rule 

Algorithm 

(Mb/sec) 

No  filtering 

0 

195.2 

Filtering  at  R\ 

1.2 

270.6 

Filtering  at  R2 

2.4 

303.6 

Filtering  at  R3 

1.8 

250.6 

Filtering  at  R4 

1.4 

222.2 

Filtering  at  Ri,  R2 

2 

251.9 

Filtering  at  R\,  R3 

2.2 

277.2 

Filtering  at  R\,  R4 

2 

277.2 

Filtering  at  R2 ,  R3 

2.6 

306.7 

Filtering  at  R2 ,  R4 

2.6 

304.8 

Filtering  at  R3,  R4 

1.4 

247.0 

Filtering  at  R\,  R2,  R3 

2.2 

248.3 

Filtering  at  R\,  R2,  R4 

2.2 

247.8 

Filtering  at  R\,  R3,  R4 

1.8 

275.9 

Filtering  at  R2,  R3,  R4 

2.2 

300.4 

Filtering  at  Rt ,  Rr,  R3,  R4 

1.8 

243.2 

Table  2  List  of  net  saving  calculated  by  the  algorithm  and  measured  throughput  in  the  simulation 
for  all  possible  combinations  of  routers  installing  the  filtering  rule. 


diting  techniques.  Finally,  we  used  experimental  results  to  validate  the  effectiveness 
and  benefits  of  our  theoretical  model  and  algorithm. 

Although  we  have  demonstrated  the  effectiveness  of  the  router-based  early  filter¬ 
ing  approach,  how  to  ensure  the  responsiveness  of  this  approach  remains  an  open 
challenge.  It  may  take  some  time  to  detect  an  ongoing  malicious  attack  and  gener¬ 
ate  the  corresponding  filtering  rule,  and  during  this  period  of  time  the  availability 
of  cloud  services  could  be  seriously  impacted.  It  is  worthy  to  investigate  how  to 
enhance  the  responsiveness  of  the  approach  such  that  the  attack  can  be  mitigated 
shortly  after  its  outbreak.  Moreover,  with  the  fast  proliferation  of  mobile  devices 
that  can  access  the  cloud  services,  it  is  also  important  to  manage  the  scalability  of 
the  proposed  approach. 
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26.  Abstract 

This  project  studies  how  to  combat  stealthy  botnets  and  malwares  by  exploring  a  novel  framework  called  IAPD,  which  stands 
for  “Integrated  Adaptive  and  Proactive  Defenses."  To  achieve  the  goal,  we  take  a  systems- and- theory  methodology,  meaning 
that  on  one  hand,  we  want  to  build  systems  that  can  deal  with  stealthy  attacks,  and  on  the  other  hand,  we  want  to  build  a 
theoretical  and  foundational  understanding  of  botnets.  Such  a  theoretical  understanding  allows  us  to  pave  the  way  for  achieving 
principled  modeling,  management,  and  decision-making  in  cyber  defense.  For  systems  research,  we  have  built  areal-life 
malware  behavior  system  called  Online  Malware  Analysis  System,  which  is  under  significant  further  enhancements  in  design 
and  implementation  towards  a  practical  tool.  For  theoretical  research,  we  have  been  building  mathematical  models  for 
understanding  and  reasoning  the  attack- defense  interactions  in  cyberspace.  Exciting  results  have  been  made  and  accepted  for 
publications  at  prestigious  venues.  We  have  made  substantial  technology  transfer  to  AFRL,  including  two  joint  patent 
applications  with  AFRL  researchers.  While  referring  to  the  report  for  details,  we  highlight  some  of  the  findings  as  follows.  -- 
There  are  many  bots/m alwares  that  can  evade  the  COTS  anti- malware  detection  tools.  Next  generation  cyber  defense 
technology  will  have  to  be  simultaneously  behavior- based,  adaptive  and  proactive.  We  have  made  substantial  progress  on 
these  fronts.  Our  future  studies  aim  to  systematically  characterize  the  distinguishing  features  of  malwares  based  on  our  large 
malware  database.  This  will  guide  us  to  design  the  next  generation  malware/bot  defense  technology.  --In  order  to  quantify  the 
(in)effectiveness  of  cyber  defense,  we  need  a  theoretical  foundation.  Our  studies  showed  that  for  this  purpose,  multiple 
disciplines  can  serve  as  useful  modeling  tools,  including  Stochastic  Processes,  Dynamical  Systems,  Control  Theory,  Statistical 
Physics  and  Statistics.  We  will  further  our  studies  toward  a  unified  foundation.  Our  envisioned  foundation  will  become  on 


indispensable  cornerstone  of  the  emerging  Science  of  Security.  In  particular,  we  found  that  existing 

mathematical/statistical/physical  theories  are  not  sufficient  for  cybersecurity;  instead,  we  need  to  develop  new  theories  that  are 
guided  by,  and  centered  on,  cybersecurity  problems  (i.e.,  “mathematical  theories  with  cybersecurity  meanings/significance”).  -- 
We  made  several  conceptual  advancements.  For  example,  there  are  two  classes  of  malware- based  attacks:  push- based  and 
pull-based.  Systematic  characterization  of  these  classes  of  attacks  will  deepen  our  understanding  of  cybersecurity.  For  this 
purpose,  we  introduced  some  core  abstractions  for  understanding  cybersecurity,  including  stochastic  cyber  attack  processes, 
stochastic  cyber  attack- defense  processes,  cyber  attack- defense  dynamics.  Finally,  cyber  attacks  and  cyber  attack- defense 
interactions  exhibit  rich  phenomena.  This  highlights  the  importance  of  studying  the  phenomenon-perspective  of  cybersecurity 
(or  cybersecurity  phenomena),  which  was  not  recognized  until  now.  —Our  study  touched  several  deepest  concepts  in 
cybersecurity.  For  example,  we  showed  for  the  first  time  that  cyber  attack  traffic  exhibits  the  so-called  Long-Range  Dependence, 
which  was  previously  known  to  be  exhibited  by  benign  traffic.  This  hints  that  it  is  perhaps  impossible  to  detect  (stealthy) 
attacks  based  on  traffic  alone.  This  sheds  lights  on  the  detectability  of  cyber  attacks.  On  the  other  hand,  we  showed  that  Long- 
Range  Dependence  can  be  exploited  to  predict  cyber  attacks  at  a  reasonable  accuracy.  This  sheds  light  on  the  predictability  of 
cyber  attacks.  It  would  be  of  fundamental  value  to  characterize  the  interaction  (or  trade-off)  between  detectability  and 
predictability  of  cyber  attacks. 
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